MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

arXiv cs.AI 06/12/26, 04:00 AM Papers
Summary
MDForge is an LLM agent that automates the design of molecular dynamics pipelines for host-guest binding free-energy calculations, achieving human-expert competitive results and discovering a novel high-affinity binder.
arXiv:2606.12916v1 Announce Type: new Abstract: Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even one molecule is expensive, ruling out trial-and-error. We automate this expert pipeline-design process with an LLM agent. Unlike existing MD agents that orchestrate a predefined tool set, we treat pipeline design as open-ended code generation in which the agent's behavior is reshaped online by verbal reward. Specifically, we build MDForge, an LLM agent whose in-context update rule densifies the sparse reward via a multi-agent debate among physics experts. On three SAMPL host-guest binding free-energy benchmarks, MDForge automatically designs MD pipelines competitive with human experts. Deployed on a library of unseen candidate guests, its CB[7] pipeline discovers a novel binder that wet-lab competition NMR confirms is a high-affinity, picomolar CB[7] binder. Our data and code are available at https://github.com/Zehong-Wang/MDForge.
Original Article
View Cached Full Text
Cached at: 06/12/26, 08:54 AM
# Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback
Source: [https://arxiv.org/html/2606.12916](https://arxiv.org/html/2606.12916)
Zehong Wang1Yijun Ma1Connor R\. Schmidt1Tianyi Ma1Weixiang Sun1 Ziming Li2Xiaoguang Guo2Chuxu Zhang2Matthew J\. Webber1Yanfang Ye1,† 1University of Notre Dame2University of Connecticut †Corresponding Author <zwang43,yye7\>@nd\.edu

###### Abstract

Molecular dynamics \(MD\) is the canonical in\-silico method for atomistic molecular science, simulating molecular behavior from first\-principle physics\. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even one molecule is expensive, ruling out trial\-and\-error\. We automate this expert pipeline\-design process with an LLM agent\. Unlike existing MD agents that orchestrate a predefined tool set, we treat pipeline design as open\-ended code generation in which the agent’s behavior is reshaped online by verbal reward\. Specifically, we build MDForge, an LLM agent whose in\-context update rule densifies the sparse reward via a multi\-agent debate among physics experts\. On three SAMPL host–guest binding free\-energy benchmarks \(CB\[7\], OAH, CBClip\), MDForge automatically designs MD pipelines competitive with human experts\. Deployed on a library of unseen candidate guests, its CB\[7\] pipeline discovers a novel binder that wet\-lab competition NMR confirms is a high\-affinity, picomolar CB\[7\] binder \(Ka≈8×1012K\_\{a\}\\approx 8\\times 10^\{12\}M\-1\)\. Our data and code are available at[https://github\.com/Zehong\-Wang/MDForge](https://github.com/Zehong-Wang/MDForge)\.

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

Zehong Wang1Yijun Ma1Connor R\. Schmidt1Tianyi Ma1Weixiang Sun1Ziming Li2Xiaoguang Guo2Chuxu Zhang2Matthew J\. Webber1Yanfang Ye1,†1University of Notre Dame2University of Connecticut†Corresponding Author<zwang43,yye7\>@nd\.edu

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.12916v1/x1.png)Figure 1:Three paradigms for MD pipeline design\.\(a\) A human expert hand\-picks each stage and iteratively revises\. \(b\) Existing LLM agents for MD design call a fixed MD toolbox with no run\-time feedback\. \(c\) MDForge emits the pipeline as code and refines it via PRISM, a multi\-expert debate over per\-stage diagnostics that returns a typed critique\.Molecular dynamics \(MD\) simulation has long been the canonical in\-silico method for studying molecular behavior at atomistic resolution\(Karplus and McCammon,[2002](https://arxiv.org/html/2606.12916#bib.bib73); Hollingsworth and Dror,[2018](https://arxiv.org/html/2606.12916#bib.bib74)\)\. By integrating first\-principle equations of motion, MD produces atomistic trajectories from which a researcher can understand binding affinities, conformational ensembles, reaction pathways, and material properties\. Several of these quantities are accessible to wet\-lab measurement only at considerable expense and time; others, such as the transient conformational states populated during an enzymatic catalytic cycle, are not directly observable at all\. These properties have made MD a mainstay of biology, drug discovery, and chemistry for decades\(Behler,[2021](https://arxiv.org/html/2606.12916#bib.bib75); Unkeet al\.,[2021](https://arxiv.org/html/2606.12916#bib.bib76)\)\.

Designing an MD pipeline for a new molecular system typically requires the work of trained scientists, and the throughput of new pipelines is correspondingly modest\(Meyet al\.,[2020](https://arxiv.org/html/2606.12916#bib.bib77)\)\. A free\-energy calculation illustrates this: it involves joint specification of a binding\-pose hypothesis, force\-field parameterization, equilibration schedule, sampling protocol, restraints, and an estimator\. These choices interact non\-trivially and few are universal: a pipeline tuned for one system class rarely transfers, because the dominant physics differs across system types\(Mobley and Gilson,[2017](https://arxiv.org/html/2606.12916#bib.bib82); Schindleret al\.,[2020](https://arxiv.org/html/2606.12916#bib.bib83)\)\. The recent surge of AI\-driven molecular predictors does not remove this need\. A data\-driven predictor outputs a target property value \(e\.g\., a binding affinity\)\(Merchantet al\.,[2023b](https://arxiv.org/html/2606.12916#bib.bib85); Rosset al\.,[2022](https://arxiv.org/html/2606.12916#bib.bib86); Wanget al\.,[2026a](https://arxiv.org/html/2606.12916#bib.bib118); Yeet al\.,[2026](https://arxiv.org/html/2606.12916#bib.bib119)\)but does not produce the atomistic trajectory MD does, so it cannot supply the mechanistic account that physics\-based simulation is invoked for in the first place\. Its applicability is also bounded by the chemical space it was trained on: the model has no foothold on a system class for which no large labeled corpus exists\(Wuet al\.,[2018](https://arxiv.org/html/2606.12916#bib.bib78); Yanget al\.,[2019](https://arxiv.org/html/2606.12916#bib.bib79)\), and on inputs outside its training distribution its predictions degrade silently\(Bender and Cortés\-Ciriano,[2021](https://arxiv.org/html/2606.12916#bib.bib81); van Tilborget al\.,[2022](https://arxiv.org/html/2606.12916#bib.bib80)\)\. MD therefore remains indispensable for mechanistic understanding\(Bottaro and Lindorff\-Larsen,[2018](https://arxiv.org/html/2606.12916#bib.bib84)\), but designing its pipeline for a new system is an expert task\.

In this work, we aim to design an agentic AI system that can automate the MD design by replicating the work of a trained expert\. Faced with a new molecular system, the expert\(Courniaet al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib62)\)first inspects its chemistry, charges, rigidity, and binding mode, and these observations dictate every downstream choice: the force\-field family, the equilibration schedule, the sampling protocol, the restraint scheme, and the estimator\. The pipeline is then run; the expert reads the diagnostics it returns \(divergence traces, free\-energy convergence plots, restraint\-release artifacts\), identifies which subsystem misbehaved, and revises the pipeline for the next trial\. Several recent LLM agents target this automation\. For example, MDCrow\(Campbellet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib43)\)wraps a general\-purpose MD toolset \(force\-field setup, simulation, trajectory analysis\) in LangChain\-style tool calls\(Yaoet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib9)\); MDAgent\(Maet al\.,[2026b](https://arxiv.org/html/2606.12916#bib.bib47)\)extends the pattern with a memory module that reuses parameter choices and analytical logic from prior tasks\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib12); Chenet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib20)\); DynaMate\(Guilbertet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib41)\)ports the same tool\-calling pattern to binding free\-energy workflows\. Yet none of these systems matches exactly what an MD expert does\. Their tool\-calling resembles the expert’s selection of pipeline pieces, but only from a fixed toolbox, narrowing what the expert can otherwise compose\. Likewise, none of them uses the feedback the workflow returns, yet that feedback \(despite sparse\) is what the expert depends on to refine the pipeline\.

To tackle both gaps, we proposeMDForge, an LLM\-driven agent that frames MD pipeline design as open\-ended code generation\(Wanget al\.,[2024a](https://arxiv.org/html/2606.12916#bib.bib11)\)under verbal reinforcement learning\(Shinnet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib8)\)\. Code generation matches the expert’s actual action space, which is not a preregistered toolbox but whatever the new system asks for\. Verbal RL matches the expert’s iteration habit, where each trial’s diagnostics drive the next pipeline\. This framing surfaces the central technical challenge of the paper: building an agent that can learn from very few feedback signals\. Each signal arrives only after a full MD workflow run, whose GPU\-hour cost confines each task to a small trial budget, far too limited for the agent to iteratively update behaviors in a typical approach\(Wanget al\.,[2026b](https://arxiv.org/html/2606.12916#bib.bib117); Chenet al\.,[2026](https://arxiv.org/html/2606.12916#bib.bib40); Guptaet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib1)\)\.

At the heart of MDForge is Process\-Reward Interpretation via Subsystem Mediation\(PRISM\), an in\-context update rule that turns the handful of terminal rewards into a dense, typed learning signal along two axes\. First, PRISM exploits the staged nature of an MD pipeline \(preparation, equilibration, production sampling, analysis\): it harvests per\-stage diagnostics from the simulator’s intermediate outputs, so the agent receives feedback at every stage boundary rather than only at the end of the run\(Lightmanet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib21); Uesatoet al\.,[2022](https://arxiv.org/html/2606.12916#bib.bib26); Wanget al\.,[2024b](https://arxiv.org/html/2606.12916#bib.bib28)\)\. Second, PRISM launches a panel of physics experts \(force field, sampling, analysis\) to debate each diagnostic\(Duet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib15)\)and produce a typed, subsystem\-attributable critique that reshapes MDForge’s behavior, surfacing the kind of physical interpretation only experts can provide\. Empirically, MDForge produces pipelines comparable to expert hand\-designs on three SAMPL host–guest binding free\-energy benchmarks\(Muddanaet al\.,[2014](https://arxiv.org/html/2606.12916#bib.bib4); Yinet al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib87)\)\(CB\[7\], OAH, and CBClip\)\. The best AI\-designed CB\[7\] pipeline, applied to a library of unseen candidate guests, discovers a novel binder confirmed by wet\-lab competition NMR to be a high\-affinity \(picomolar\) CB\[7\] binder \(Ka≈8×1012K\_\{a\}\\approx 8\\times 10^\{12\}M\-1\)\.

## 2Related Work

Molecular dynamics\.MDForge sits atop an established physics\-based MD stack rather than competing with any of its parts: alchemical FEP/TI with BAR/MBAR estimators\(Bennett,[1976](https://arxiv.org/html/2606.12916#bib.bib95); Shirts and Chodera,[2008](https://arxiv.org/html/2606.12916#bib.bib96); Meyet al\.,[2020](https://arxiv.org/html/2606.12916#bib.bib77)\), mature simulation engines\(Eastmanet al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib100); Abrahamet al\.,[2015](https://arxiv.org/html/2606.12916#bib.bib101); Caseet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib102)\), and standard biomolecular force\-field families\. Recent neural work replaces individual slices of this stack with learned components: ML force fields\(Behler,[2021](https://arxiv.org/html/2606.12916#bib.bib75); Unkeet al\.,[2021](https://arxiv.org/html/2606.12916#bib.bib76)\), structure predictors\(Jumperet al\.,[2021](https://arxiv.org/html/2606.12916#bib.bib3)\), and equilibrium samplers\(Noéet al\.,[2019](https://arxiv.org/html/2606.12916#bib.bib103)\)\. MDForge automates the*workflow*itself as executable code, so the design space is a program\-synthesis over the existing toolset rather than the parameter space of a fixed pipeline template\.

Autonomous science agents\.LLM\-driven scientific agents have integrated literature search, hypothesis proposal, and code synthesis into runnable discovery pipelines across chemistry, materials, and biology\(Boikoet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib37); Branet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib52); Luet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib39); Merchantet al\.,[2023a](https://arxiv.org/html/2606.12916#bib.bib53)\)\. MD\-specific agents have converged on a*tool\-calling*pattern that orchestrates a fixed library of engines, force fields, and analysis routines under LLM control\(Campbellet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib43); Maet al\.,[2026b](https://arxiv.org/html/2606.12916#bib.bib47); Guilbertet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib41); Chandrasekhar and Farimani,[2025](https://arxiv.org/html/2606.12916#bib.bib44); Shiet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib45)\)\. MDForge instead treats MD pipeline design as open\-ended code generation in the lineage of program\-synthesis agents\(Wanget al\.,[2024a](https://arxiv.org/html/2606.12916#bib.bib11); Romera\-Paredeset al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib38)\), operating in a regime where the supervisory signal is both sparse \(one terminal reward per trial\) and expensive \(GPU\-hours of MD execution\)\.

See Appendix[A](https://arxiv.org/html/2606.12916#A1)for extended discussion\.

## 3Problem Setup

Task\.Given a target system class𝒯=\{s1,…,sM\}\\mathcal\{T\}=\\\{s\_\{1\},\\ldots,s\_\{M\}\\\}of related molecular systems with experimental references\{yexp\(sm\)\}\\\{y\_\{\\exp\}\(s\_\{m\}\)\\\}for some target observableyy\(e\.g\., binding free\-energy\), the agent emits an executable MD pipelineπ∈Π\\pi\\in\\Pithat, applied across𝒯\\mathcal\{T\}, minimizes the mean per\-system prediction errorℒ\(π\)=1M∑m\|y^π\(sm\)−yexp\(sm\)\|\\mathcal\{L\}\(\\pi\)=\\tfrac\{1\}\{M\}\\sum\_\{m\}\|\\hat\{y\}\_\{\\pi\}\(s\_\{m\}\)\-y\_\{\\exp\}\(s\_\{m\}\)\|\.

POMDP\.We cast the design loop asℳ=\(𝒮,𝒜,𝒪,T,R,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{O\},T,R,\\gamma\): state𝒮\\mathcal\{S\}is the design history\(π1:t,Dπ1:t\)\(\\pi\_\{1:t\},D\_\{\\pi\_\{1:t\}\}\)of pipelines tried and their stage\-level diagnostics; the action space𝒜=Π\\mathcal\{A\}=\\Piis the open\-ended space of executable programs that emit an MD workflow over four canonical stages \(Prep, Equilibration, Production, Analysis\); observations𝒪⊆𝒱\\mathcal\{O\}\\subseteq\\mathcal\{V\}are the natural\-language documents the simulator returns; transitionsTTare deterministic, governed by physics and the toolchain; the rewardRRrealizes only at horizon asrπ∗=−ℒ\(π\)r^\{\*\}\_\{\\pi\}=\-\\mathcal\{L\}\(\\pi\); andγ=1\\gamma=1\. The reward is therefore both*sparse*\(one event per pipeline\) and*expensive*\(a GPU\-hour production run per trial\)\. Therefore, we intend to use verbal RL to solve the problem\.

###### Definition 1\(Verbal RL\)\.

With𝒱\\mathcal\{V\}the space of natural\-language strings, a POMDP is*verbal*if𝒜,𝒪⊆𝒱\\mathcal\{A\},\\mathcal\{O\}\\subseteq\\mathcal\{V\}and the policy is an LLM with frozen parametersθ\\thetaacting on a textual context𝒞t∈𝒱\\mathcal\{C\}\_\{t\}\\in\\mathcal\{V\},

πt\+1\\displaystyle\\pi\_\{t\+1\}∼LLMθ\(⋅∣𝒞t\+1\),\\displaystyle\\,\\sim\\,\\mathrm\{LLM\}\_\{\\theta\}\(\\,\\cdot\\,\\mid\\,\\mathcal\{C\}\_\{t\+1\}\),\(1\)𝒞t\+1\\displaystyle\\mathcal\{C\}\_\{t\+1\}=Update\(𝒞t,πt,ot,rt\),\\displaystyle\\,=\\,\\mathrm\{Update\}\(\\mathcal\{C\}\_\{t\},\\,\\pi\_\{t\},\\,o\_\{t\},\\,r\_\{t\}\),\(2\)whereUpdate\\mathrm\{Update\}is an LLM call folding each trial outcome\(ot,rt\)\(o\_\{t\},r\_\{t\}\)back into the context\.

## 4MDForge

We present MDForge, the LLM agent that instantiates the verbal RL for automatic molecular dynamics workflow design\. The framework is shown in Figure[2](https://arxiv.org/html/2606.12916#S4.F2)with the full protocol in Appendix[B](https://arxiv.org/html/2606.12916#A2)\.

![Refer to caption](https://arxiv.org/html/2606.12916v1/x2.png)Figure 2:Overview of MDForge\.\(a\) Automating MD design for binding affinity prediction, instantiated on the SAMPL CB\[7\], OAH, and CBClip testbeds\. \(b\) A Code agent reads the context bundle𝒞t=\{T,πt,Kt,Ht\}\\mathcal\{C\}\_\{t\}\{=\}\\\{T,\\pi\_\{t\},K\_\{t\},H\_\{t\}\\\}\(task, current pipeline as typed code, critique set, and headline\-metric trial history\) and emits an executable pipeline through a sandbox\. Execution proceeds throughK=4K\{=\}4canonical stages \(Preparation, Equilibration, Production, Analysis\), yielding per\-stage diagnosticsDπD\_\{\\pi\}\. A PRISM panel reviewsπ\\pipre\- and post\-execution to emit typed critiquescpre,cpostc\_\{\\text\{pre\}\},c\_\{\\text\{post\}\}, which feed back into𝒞t\+1\\mathcal\{C\}\_\{t\+1\}as in\-context fast\-weight updates\. \(c\)J=3J\{=\}3specialists \(Force\-Field, Sampling, Analysis\) with reputationsρj\\rho\_\{j\}first produce independent opinions \(Round 1\), then revise under cross\-visibility \(Round 2\); a reputation\-weighted aggregator𝒜ρ\\mathcal\{A\}\_\{\\rho\}emits a single typed critique \(subsystem \+ action\)\.### 4\.1Design Rationale

Verbal RL reduces trial\-to\-trial learning to the context rewrite𝒞t↦𝒞t\+1\\mathcal\{C\}\_\{t\}\\mapsto\\mathcal\{C\}\_\{t\+1\}, equivalently a fast\-weight update of the LLM’s induced state without touching its parameters\(Schmidhuber,[1992](https://arxiv.org/html/2606.12916#bib.bib5); Baet al\.,[2016](https://arxiv.org/html/2606.12916#bib.bib6); Schlaget al\.,[2021](https://arxiv.org/html/2606.12916#bib.bib7)\)\. Under an expensive reward \(each trial requires GPU\-hours of MD execution before producingrπ∗r^\{\*\}\_\{\\pi\}\), the only lever is to enrich the information each reward event carries\. Two general approaches densify a sparse signal: \(i\) split the reward across the pipeline so each part receives its own signal\(Lightmanet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib21)\), and \(ii\) attach explanatory text to each signal value\(Shinnet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib8)\)\. An MD pipeline supplies a natural instantiation of each: it is staged along an execution sequence, yielding per\-stage diagnostics, and naturally analyzed by a multi\-agent panel of physics specialists, yielding critique\.

### 4\.2PRISM: Producing the Dense Signal

PRISM \(Process\-Reward Interpretation via Subsystem Mediation\) is the densification machinery of MDForge: it converts the single terminal scalarrπ∗r^\{\*\}\_\{\\pi\}into dense signals,

rπ∗→PRISM\(Dπ,cpre,cpost\),r^\{\*\}\_\{\\pi\}\\;\\xrightarrow\{\\;\\text\{PRISM\}\\;\}\\;\\bigl\(D\_\{\\pi\},\\,c\_\{\\text\{pre\}\},\\,c\_\{\\text\{post\}\}\\bigr\),\(3\)whereDπD\_\{\\pi\}is aKK\-tuple of per\-stage physics diagnostics extracted from the simulator \(K=4K\{=\}4canonical stages\), andcpre,cpostc\_\{\\text\{pre\}\},c\_\{\\text\{post\}\}are typed pre\- and post\-execution critiques aggregated from a panel ofJ=3J\{=\}3physics specialists\.

Per\-stage physics diagnostics\.TheK=4K\{=\}4stages \(Prep, Equilibration, Production sampling, Analysis\) each run against a well\-defined physical objective and expose interpretable diagnostics at their boundary\. We attach to each stage a physics\-grounded structured diagnostic: a typed record of canonical observables \(e\.g\., force\-field self\-consistency at Prep, ergodicity and PME accuracy at Production, free\-energy convergence at Analysis\), extracted directly from the simulator’s output rather than synthesized by an LLM\. Concatenated across stages, these formDπD\_\{\\pi\}\. Only Production phase incurs extensive GPU\-hour cost\.

Multi\-agent debate over physics subsystems\.DπD\_\{\\pi\}is not yet actionable: a single measurement typically reflects several superimposed causes \(force\-field error, integrator instability, restraint misplacement, unconverged estimator\) that no generic critic can disentangle\. We delegate interpretation to the panel ofJ=3J\{=\}3specialist LLM agents, holding fixed, non\-overlapping jurisdictions over canonical MD subsystems:*Force Field*,*Sampling*, and*Analysis*\. They deliberate in two rounds with cross\-visibility\(Duet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib15)\), and an aggregator𝒜ρ\\mathcal\{A\}\_\{\\rho\}collapses their opinions into a single typed critique, weighted by per\-expert reputationsρ=\(ρ1,…,ρJ\)\\rho=\(\\rho\_\{1\},\\ldots,\\rho\_\{J\}\)so the panel can downweight historically\-miscalibrated specialists rather than equal\-averaging them with reliable ones; the update rule forρ\\rhois given in §[4\.3](https://arxiv.org/html/2606.12916#S4.SS3)\. Before the panel seesπ\\pi, a tool\-using Engineer agent debugs engineering faults \(uncaught exceptions, missing files, mis\-called APIs\) in a sandbox without altering methodological choices, reserving panel deliberation for failures admitting physical attribution\.

The panel is invoked at two points per trial: pre\-execution it reviewsπ\\piand producescprec\_\{\\text\{pre\}\}, a cheap screen the multi\-agent system can act on before burning extensive running cost; post\-execution it reviews the pipeline’s execution resultsBπB\_\{\\pi\}\(predicted free energies with the corresponding accuracy and ranking metrics\), producing

cpost=𝒜ρ\(π,Bπ\)\.c\_\{\\text\{post\}\}\\;=\\;\\mathcal\{A\}\_\{\\rho\}\\\!\\bigl\(\\,\\pi,\\,B\_\{\\pi\}\\,\\bigr\)\.\(4\)Together withDπD\_\{\\pi\}, the pair\(cpre,cpost\)\(c\_\{\\text\{pre\}\},c\_\{\\text\{post\}\}\)completes the PRISM densification map of Equation \([3](https://arxiv.org/html/2606.12916#S4.E3)\)\. The panel is advisory: only hard signals \(Layer\-1 rejection, divergence, timeout\) gate execution\.

Table 1:Results on SAMPL host–guest binding benchmarks: CB\[7\] \(nheld=10n\_\{\\mathrm\{held\}\}\{=\}10\), OAH \(nheld=5n\_\{\\mathrm\{held\}\}\{=\}5\), CBClip \(nheld=6n\_\{\\mathrm\{held\}\}\{=\}6\), with 4 training guests selected at experimental\-ΔG\\Delta Gquintile positions\. We reportR2R^\{2\}, Spearmanρ\\rho, and Kendallτ\\tauagainst experimentalΔG\\Delta G, for the best ofN=5N\{=\}5successful trials per host \(selected by training\-setτ\\tau\)\.*Runnable*summarizes how often a method produces an executable pipeline across theN=5N\{=\}5trials:✓\\checkmark= all 5,= 1–2,×\\times= none\. “–” marks methods with no runnable trial on any host\.
### 4\.3Code Agent Update and Reputation Loop

Code agent update\.Before the first trial, the panel holds a one\-off design discussion over the task descriptionTT\(with web\-search access\), and its aggregated recommendations seed the Code agent’s initial proposal\. After each subsequent trial the Code agent regenerates the pipeline conditioned on the context bundle

𝒞t=\{T,πt,Kt,Ht\},\\mathcal\{C\}\_\{t\}\\;=\\;\\bigl\\\{\\,T,\\;\\pi\_\{t\},\\;K\_\{t\},\\;H\_\{t\}\\,\\bigr\\\},\(5\)whereKt=\{cl1,t,cpre,t,cpost,t\}K\_\{t\}=\\\{c\_\{l1,t\},\\,c\_\{\\text\{pre\},t\},\\,c\_\{\\text\{post\},t\}\\\}is the critique set for trialtt\(Layer\-1 static check, pre\-execution panel, post\-execution panel on the benchmark\) andHtH\_\{t\}is a compact trial history over all earlier trials \(per\-stage statuses and headline benchmark metrics\)\. The per\-stage diagnosticDπtD\_\{\\pi\_\{t\}\}and terminal rewardrπt∗r^\{\*\}\_\{\\pi\_\{t\}\}enter via the narrative text ofcpost,tc\_\{\\text\{post\},t\}and the headline metrics inHtH\_\{t\}; the reputationρt\\rho\_\{t\}does not enter the Code agent’s view directly and only shapes the aggregated critique through𝒜ρ\\mathcal\{A\}\_\{\\rho\}\. The prior pipelineπt\\pi\_\{t\}is included in𝒞t\\mathcal\{C\}\_\{t\}so revisions can remain localized edits when feasible rather than wholesale rewrites; the full edit protocol and prompt template are in Appendix[B](https://arxiv.org/html/2606.12916#A2)\.

Reputation loop\.A slow per\-task loop maintainsρt\\rho\_\{t\}by per\-expert agreement betweencpre,tc\_\{\\text\{pre\},t\}andcpost,tc\_\{\\text\{post\},t\}: experts whose pre\-trial calls are validated by post\-execution evidence accumulate weight within the task\. The update rule and its in\-task convergence are in Appendix[B](https://arxiv.org/html/2606.12916#A2)\.

## 5Experiments

![Refer to caption](https://arxiv.org/html/2606.12916v1/x3.png)Figure 3:PRISM mechanism on CB\[7\]\.\(a\) Per\-trial held\-out Kendallτ\\tauoverN=5N\{=\}5trials \(CB\[7\] row of Table[1](https://arxiv.org/html/2606.12916#S4.T1)\)\. MDForge improves monotonically toτ=0\.56\\tau\{=\}0\.56; w/o Stage diagnostics peaks at trial 3 then collapses to0\.160\.16, overshooting without a typed signal\. \(b\) Cumulative spend ranges only$68\\mathdollar 68to$78\\mathdollar 78across methods, so panel \(a\)’s gain carries no cost premium\. \(c\) PRISM types the failures: mid\-pipeline crashes shrink as components are added, while analysis\-stage refusal \(51%→0%51\\%\\to 0\\%\) appears only with stage diagnostics, since the convergence guard creates that category\. \(d\) One typed signal yields one localized edit, not a shotgun rewrite: the Analysis stage emitsmbar\_overlap\_above\_0p03 = False, the analysis expert proposes a guard, and the Code agent applies a one\-line fix at the named location\.Table 2:Per\-stage tool selection\.Tool selection on SAMPL4 CB\[7\] across \(i\)*pure expert*, the canonicalpAPRikaAPR pipeline\(Slochoweret al\.,[2019](https://arxiv.org/html/2606.12916#bib.bib114)\); \(ii\)*AI \+ non\-expert*, a chemistry non\-expert’s pipeline assembled with LLM coding assistance; \(iii\)*pure AI*, MDForge autonomous\.*Navy bold italics*mark choices that depart from the expert default\. MDForge stays in the expert’s family \(GAFF2/AM1\-BCC, APR, MBAR\) with only reliability\-flavored engineering deviations; the AI \+ non\-expert pipeline diverges \(z\-PMF umbrella, no MBAR guards\)\. Performance reports rank metrics \(ρ\\rho,τ\\tau,R2R^\{2\}\), which wash out force\-field\-specific systematic biases on absoluteΔG\\Delta G\.### 5\.1Setup

Task\.We test MDForge on the SAMPL host–guest binding free\-energy challenges\(Muddanaet al\.,[2014](https://arxiv.org/html/2606.12916#bib.bib4)\), a widely used MD benchmark and the standard tractable proxy for the protein–ligand binding problem that drives drug discovery\. A rigid macrocyclic host plays the role of the protein pocket, and the task is to compute the binding free\-energyΔG^\\Delta\\hat\{G\}of a small molecule guest against a known experimental reference\. The host–guest setting preserves the thermodynamic machinery of protein–ligand binding \(water displacement, ion solvation, anharmonic guest sampling\) while removing protein flexibility, so accuracy here is a necessary precondition for the full\-protein setting and a fair stress test of an MD design agent\. We evaluate on three hosts of distinct chemistry:*CB\[7\]*\(SAMPL4\),*OAH*\(SAMPL4\), and*CBClip*\(SAMPL5\)\. For each host the guests are split into a 4\-guest training set \(visible to the verbal RL feedback\) and the remainder as a held\-out test set\. The docked pose for each guest is supplied; the experimental referenceΔGexp\\Delta G\_\{\\exp\}is held aside from the agent\. We use a cheap\-MD configuration: each guest evaluation runs in≈2\\approx 2GPU\-hours on a single A40 node\.

Methods\.Baselines form a capability ladder organized by the type of feedback available during pipeline construction: \(1\)*No feedback*: one\-pass code generation with no critique and no execution signal; \(2\)*LLM critic*\(Madaanet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib10)\): an auxiliary LLM reviews and rewrites the draft, but no code is executed; \(3\)*Step\-level feedback*\(Yaoet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib9)\): tool calls return intermediate results during reasoning, enabling partial in\-trial recovery but no memory across trials; \(4\)*Trial\-level feedback*\(Shinnet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib8)\): a natural\-language summary of each completed trial’s outcome conditions the next trial; and \(5\)*MDForge*: trial\-level feedback augmented with PRISM \(stage diagnostics and multi\-expert debate\)\. Each method runs untilN=5N\{=\}5successful trials accumulate per host\.

### 5\.2Main Results

Table[1](https://arxiv.org/html/2606.12916#S4.T1)separates two questions:*can the agent produce a runnable MD pipeline*, and*how much ranking signal does it then recover*? The first already filters most of the ladder: No\-feedback and LLM\-critic baselines fail at coding on every host \(0/5\), Step\-level feedback succeeds only intermittently \(1 to 2 of 5\), and reliable code emerges only with cross\-trial memory \(Trial\-level feedback and MDForge, 5/5 everywhere\)\. Among methods that do code, MDForge attains a held\-out Kendallτ\\tauof0\.560\.56on CB\[7\] and0\.470\.47on CBClip against0\.240\.24and0\.200\.20for the Trial\-level baseline,*more than doubling the ranking signal that transfers from the 4\-guest training set to held\-out guests*\. On CBClip this places MDForge in the performance band of the SAMPL5 BEDAM and SOMD human submissions\(Yinet al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib87)\)\. OAH is an information\-limited exception \(nheld=5n\_\{\\mathrm\{held\}\}\{=\}5, narrowΔG\\Delta Gwindow\): all coding methods cluster atτ≈0\.20\\tau\\approx 0\.20\. We use rank\-based metrics because MD predictions carry method\-specific force\-field offsets \(e\.g\., GAFF over\-binds cationic CB\[7\] guests\)\.

### 5\.3Diagnosis

Figure[3](https://arxiv.org/html/2606.12916#S5.F3)asks whether MDForge’s CB\[7\] endpoint comes from the claimed mechanism: that verbal RL can turn PRISM’s typed signals into localized pipeline edits\.

Effectiveness \(figure[3](https://arxiv.org/html/2606.12916#S5.F3)a\)\.MDForge climbs monotonically toτ=0\.56\\tau\{=\}0\.56, while the debate\-only ablation peaks at0\.470\.47in trial 3 then collapses to0\.160\.16\. Without per\-stage signal, the agent cannot tell a good edit from a regression and discards a working pipeline\. The other two methods stall nearτ≈0\.20\\tau\\approx 0\.20\. PRISM’s value is*keeping*a highτ\\tau\.

Cost \(figure[3](https://arxiv.org/html/2606.12916#S5.F3)b\)\.All four methods land within$68\\mathdollar 68to$78\\mathdollar 78at trial 5\. MDForge’s early per\-trial token overhead is amortized once edits become localized rather than wholesale rewrites, so panel\-\(a\)’s ranking gain carries no cost premium\.

Failure typing \(figure[3](https://arxiv.org/html/2606.12916#S5.F3)c\)\.PRISM*types*the failures rather than lowering their count\. Mid\-pipeline crashes shrink from51%51\\%w/o both to0%0\\%on MDForge, while analysis\-stage refusal appears only with stage diagnostics: it is the convergence guard explicitly declining to emit a silent MBAR false\-convergence\. The56%56\\%clean\-success on w/o Stage diagnostics therefore includes outcomes the guard would have refused\.

Case study \(figure[3](https://arxiv.org/html/2606.12916#S5.F3)d\)\.One trial follows the chain end to end: the Analysis stage emitsmbar\_overlap\_above\_0p03 = False; the analysis specialist proposes a guard; the Code agent applies a one\-line edit at the named location, not a shotgun rewrite\. Panel \(c\)’s blue segments aggregate this mechanism across trials, making panel \(a\)’s gain reproducible rather than lucky\.

![Refer to caption](https://arxiv.org/html/2606.12916v1/x4.png)Figure 4:End\-to\-end discovery from in\-silico screening to wet\-lab confirmation\.\(a\) Ten unseen candidate guests ranked by MDForge\-predicted binding free\-energy; top\-1 is Bromantane \(Brom\)\. \(b\) Competition1H NMR assay: CB\[7\], the picomolar reference guest ferrocenylmethyl\-trimethylammonium \(FMTA;KaFMTA≈2×1012K\_\{a\}^\{\\mathrm\{FMTA\}\}\\approx 2\\times 10^\{12\}M\-1\(Alnajjaret al\.,[2021](https://arxiv.org/html/2606.12916#bib.bib107)\)\), and Brom co\-equilibrate\. \(c\) Top: measuredk¯rel=4\.26\\bar\{k\}\_\{\\mathrm\{rel\}\}=4\.26\(n=3n\{=\}3\) yieldsKaBrom≈8×1012K\_\{a\}^\{\\mathrm\{Brom\}\}\\approx 8\\times 10^\{12\}M\-1\(ΔGexp≈−17\.6\\Delta G\_\{\\mathrm\{exp\}\}\\approx\-17\.6kcal/mol\)\. Bottom: Brom \(red star\) plotted against published CB\[7\] binders spanningKaK\_\{a\}from10510^\{5\}to101710^\{17\}M\-1\(Caoet al\.,[2014](https://arxiv.org/html/2606.12916#bib.bib108); Rekharskyet al\.,[2007](https://arxiv.org/html/2606.12916#bib.bib109); Moghaddamet al\.,[2011](https://arxiv.org/html/2606.12916#bib.bib110); Liuet al\.,[2005](https://arxiv.org/html/2606.12916#bib.bib111); Mock and Shih,[1986](https://arxiv.org/html/2606.12916#bib.bib112)\)\.
### 5\.4Does MDForge Build Like a Human Expert?

Beyond agent\-vs\-agent comparison, the chemistry\-credibility check is whether MDForge’s*pipeline*is the kind of pipeline a human MD expert would actually design\. As our expert reference we use the canonicalpAPRikaAPR pipeline from the Gilson lab\(Henriksenet al\.,[2015](https://arxiv.org/html/2606.12916#bib.bib113); Slochoweret al\.,[2019](https://arxiv.org/html/2606.12916#bib.bib114)\), the open\-source reference implementation that has served as thede factostandard for host–guest free energy calculations on cucurbit\[n\]uril and related systems for nearly a decade\. Table[2](https://arxiv.org/html/2606.12916#S5.T2)contrasts per\-stage tool selection on SAMPL4 CB\[7\] across three pipelines: this expert reference; an AI \+ non\-expert pipeline that a chemistry non\-expert assembled with LLM coding assistance; and MDForge running autonomously\. The reading splits the two AI\-touched columns in opposite directions: MDForge*stays in the expert’s methodological family*\(GAFF2 \+ AM1\-BCC, APR umbrella, MBAR\), departing only on reliability\-flavored engineering \(italicized in the table\)\. The AI \+ non\-expert pipeline, in contrast,*diverges to an alternative method family*\(z\-PMF umbrella rather than APR with dummy\-atom geometry\), a defensible but methodologically distinct route that LLM coding assistance plausibly leads a non\-expert toward\.

The*Performance*row sharpens the comparison along the same gradient\. MDForge attainsρ=0\.68\\rho\{=\}0\.68,τ=0\.56\\tau\{=\}0\.56,R2=0\.58R^\{2\}\{=\}0\.58on the SAMPL4 CB\[7\] guests,*recovering 78–82% of the expert’s ranking utility*\(ρ=0\.83\\rho\{=\}0\.83,τ=0\.68\\tau\{=\}0\.68,R2=0\.74R^\{2\}\{=\}0\.74\) without a human in the loop, and*beating the AI \+ non\-expert pipeline on all three rank\-correlation metrics*\(ρ=0\.61\\rho\{=\}0\.61,τ=0\.47\\tau\{=\}0\.47,R2=0\.44R^\{2\}\{=\}0\.44\)\. Absolute\-error metrics are dominated by force\-field\-specific systematic biases \(e\.g\., the well\-known GAFF CB\[7\]–cation over\-binding\) and therefore do not present a truly "apples\-to\-apples" cross\-pipeline comparison\.

### 5\.5Prospective Wet\-Lab Validation

Retrospective benchmark accuracy is necessary but not sufficient: a useful pipeline must also*deploy*prospectively\. We applied the best MDForge CB\[7\] pipeline to ten unseen candidate guests drawn from the compound bank we extracted from ChEMBL\(Zdrazilet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib115)\)and DrugBank\(Knoxet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib116)\), ranked them by predictedΔG^\\Delta\\hat\{G\}, and submitted the top\-1 \(Bromantane, "Brom"; Figure[4](https://arxiv.org/html/2606.12916#S5.F4)a\) for wet\-lab measurement\. BecauseKaK\_\{a\}in the picomolar regime exceeds what direct isothermal titration calorimetry can resolve, we used competition1H NMR against a reference guest of known affinity \(Figure[4](https://arxiv.org/html/2606.12916#S5.F4)b\)\. Co\-equilibrating Brom with CB\[7\] and the canonical picomolar reference ferrocenylmethyl\-trimethylammonium \(FMTA;KaFMTA≈2×1012K\_\{a\}^\{\\mathrm\{FMTA\}\}\\approx 2\\times 10^\{12\}M\-1\(Alnajjaret al\.,[2021](https://arxiv.org/html/2606.12916#bib.bib107)\)\) establishes the exchangeCB⋅FMTA\+Brom⇌CB⋅Brom\+FMTA\\mathrm\{CB\}\\\!\\cdot\\\!\\mathrm\{FMTA\}\+\\mathrm\{Brom\}\\rightleftharpoons\\mathrm\{CB\}\\\!\\cdot\\\!\\mathrm\{Brom\}\+\\mathrm\{FMTA\}, whose relative constantKrel=KaBrom/KaFMTAK\_\{\\mathrm\{rel\}\}=K\_\{a\}^\{\\mathrm\{Brom\}\}/K\_\{a\}^\{\\mathrm\{FMTA\}\}is read directly from the bound and unbound NMR integrations of both guests \(the free\-host concentration cancels\)\. Averaging across three independent guest\-ratio mixtures yieldsKrel=4\.26K\_\{\\mathrm\{rel\}\}=4\.26\(n=3n\{=\}3samples run at different molar ratios\), henceKaBrom≈8×1012K\_\{a\}^\{\\mathrm\{Brom\}\}\\approx 8\\times 10^\{12\}M\-1\(ΔGexp≈−17\.6\\Delta G\_\{\\mathrm\{exp\}\}\\approx\-17\.6kcal/mol; Figure[4](https://arxiv.org/html/2606.12916#S5.F4)c, top\), approximately four\-fold tighter than the FMTA reference\. The landscape \(Figure[4](https://arxiv.org/html/2606.12916#S5.F4)c, bottom\) places Brom in the picomolar high\-affinity tier of published CB\[7\] binders, comparable to deliberately\-engineered ferrocene and adamantane di\-ammonium guests, while remaining∼\\sim5 orders of magnitude below the current record holder \(diamantane\-bis\(ammonium\)\)\(Caoet al\.,[2014](https://arxiv.org/html/2606.12916#bib.bib108)\); all entries above Brom on this scale were obtained through years of human\-driven design\.

We tested only the top\-1 candidate, not the other nine: this single wet\-lab measurement is intended to demonstrate that MDForge can translate in\-silico design into a real prospective scientific discovery, rather than to make the discovered molecule itself the primary contribution of this work\. Furthermore, no claims are made regarding the expected binding affinity or validation of the full rankings across all ten hits identified from in\-silico screening\.

## 6Conclusion

We introduced MDForge, an LLM agent that designs molecular dynamics pipelines as open\-ended code under verbal RL\. Its sparse terminal reward is densified by PRISM into per\-stage diagnostics and a typed, subsystem\-attributable expert critique\. On SAMPL host–guest benchmarks, MDForge dominates other LLM\-agent designs in accuracy\. Its per\-stage tool choices track those of the human\-expert submissions\. Deployed prospectively on an unseen compound library, MDForge discovered a novel CB\[7\] binder\. Wet\-lab competition NMR against the FMTA picomolar reference measuresKa≈8×1012K\_\{a\}\\approx 8\\times 10^\{12\}M\-1, placing it in the high\-affinity \(picomolar\) regime\.

## Limitations

Benchmark scope\.We benchmark MDForge on three host–guest systems from the SAMPL series \(CB\[7\], OAH, CBClip\), a restricted slice of the broader MD design space\. The temporal\-staging and subsystem\-expert decompositions are in principle agnostic to the target system class\. But the absence of equally mature open benchmarks for protein–ligand affinity, membrane protein insertion, and other binding regimes bounds our empirical reach\. Future work can apply the same framework to broader binding benchmarks \(e\.g\., FEP\+, PDBbind\) and to non\-binding MD applications such as conformational free energy surfaces\.

MD fidelity\.We operate in a cheap\-MD configuration \(≈\\approx2 GPU\-hours per guest\) to keep the per\-task budget tractable for systematic evaluation\. The framework is in principle compatible with longer production sampling, higher\-accuracy force fields, and explicit polarization\. We expect the comparative method ranking in Table[1](https://arxiv.org/html/2606.12916#S4.T1)to persist across configurations, since the limiting factor in our regime is the verbal RL update rather than the underlying MD fidelity\. Verifying this at higher fidelity is left to future work\.

Scope of the wet\-lab demonstration\.The prospective wet\-lab measurement \(§[5\.5](https://arxiv.org/html/2606.12916#S5.SS5)\) covers exactly one data point at the top of the predicted ranking\. We do not validate the full ten\-compound ranking individually, nor confirm binding affinity for any of the other predicted binders\. Those measurements are out of scope here\. What we do validate is that the framework retrieves a real high\-affinity CB\[7\] binder at the top of an unseen library, end\-to\-end\. We treat this as the right scope for a first prospective demonstration; broader prospective screens against multiple hosts are deferred to follow\-up work\.

## Ethical Considerations

Dual\-use risk\.MDForge automates the design of MD pipelines for binding affinity prediction\. The same capability that accelerates therapeutic discovery could in principle be repurposed to design harmful binders\. Our prospective wet\-lab result sharpens this concern from a theoretical possibility to an operational one\. The framework’s direct output is a simulation protocol \(Python code\), not a molecule\. The locus of dual\-use control therefore sits upstream \(in the candidate library and the choice of target\) and downstream \(in the interpretation and deployment\), not at MDForge itself\. Responsible deployment requires governance over those layers, consistent with established norms in computational chemistry and structure\-based drug discovery\(Boikoet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib37); Branet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib52)\)\.

Scope of the wet\-lab demonstration\.The compound bank from which the top\-1 candidate was drawn was constructed in separate work and is not a curated set of pharmacologically active species\. CB\[7\] is a synthetic macrocyclic host, not a biological target\. The discovered binder therefore has no direct therapeutic use, and the experiment should not be read as a drug\-discovery claim\. The bank composition is governed by the originating project; the curated compound bank will be released in a follow\-on publication\.

Computational footprint\.End\-to\-end MD\-based virtual screening is energy\-intensive\. We deliberately operate in a cheap\-MD configuration \(≈\\approx2 GPU\-hours per guest\) and a small trial budget, both of which lower the per\-discovery energy cost relative to traditional expert\-design iteration\. Useful community directions for further reducing this footprint include energy\-aware scheduling, surrogate filters that defer expensive sampling, and shared baselines that quantify energy\-per\-prediction relative to human\-designed pipelines\.

Reproducibility and auditability\.AI\-designed MD pipelines must be auditable by domain experts before being treated as scientific instruments\. MDForge produces Python code as its action rather than opaque numerical parameters, so individual pipelines are inspectable\. The generated code, however, is often long and stylistically idiosyncratic, which raises the audit burden\. We release the agent code, per\-trial pipeline artifacts, multi\-expert prompts, and the full LLM configuration\. Promising future directions include pipeline\-summarization tools that compress agent\-generated code into expert\-readable protocol descriptions, and automated provenance tracking that links each design choice back to the typed critique that motivated it\.

## References

- GROMACS: high performance molecular simulations through multi\-level parallelism from laptops to supercomputers\.SoftwareX\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p1.1)\.
- M\. A\. Alnajjar, W\. M\. Nau, and A\. Hennig \(2021\)A reference scale of cucurbit\[7\]uril binding affinities\.Organic & Biomolecular Chemistry\.Cited by:[Figure 4](https://arxiv.org/html/2606.12916#S5.F4),[§5\.5](https://arxiv.org/html/2606.12916#S5.SS5.p1.13)\.
- M\. Andrychowicz, F\. Wolski, A\. Ray, J\. Schneider, R\. Fong, P\. Welinder, B\. McGrew, J\. Tobin, P\. Abbeel, and W\. Zaremba \(2017\)Hindsight experience replay\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.4](https://arxiv.org/html/2606.12916#A1.SS4.p1.4),[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- J\. A\. Arjona\-Medina, M\. Gillhofer, M\. Widrich, T\. Unterthiner, J\. Brandstetter, and S\. Hochreiter \(2019\)RUDDER: return decomposition for delayed rewards\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- J\. Ba, G\. E\. Hinton, V\. Mnih, J\. Z\. Leibo, and C\. Ionescu \(2016\)Using fast weights to attend to the recent past\.InAdvances in Neural Information Processing Systems,Cited by:[§4\.1](https://arxiv.org/html/2606.12916#S4.SS1.p1.2)\.
- J\. Baek, S\. K\. Jauhar, S\. Cucerzan, and S\. J\. Hwang \(2025\)ResearchAgent: iterative research idea generation over scientific literature with large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon,et al\.\(2022\)Constitutional AI: harmlessness from AI feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§A\.6](https://arxiv.org/html/2606.12916#A1.SS6.p1.1)\.
- J\. Behler \(2021\)Four generations of high\-dimensional neural network potentials\.Chemical Reviews\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p1.1)\.
- A\. Bender and I\. Cortés\-Ciriano \(2021\)Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: ways to make an impact, and why we are not there yet\.Drug Discovery Today\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p2.1)\.
- C\. H\. Bennett \(1976\)Efficient estimation of free energy differences from Monte Carlo data\.Journal of Computational Physics\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p1.1)\.
- D\. A\. Boiko, R\. MacKnight, B\. Kline, and G\. Gomes \(2023\)Autonomous chemical research with large language models\.Nature\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p2.1),[Ethical Considerations](https://arxiv.org/html/2606.12916#Sx2.p1.1)\.
- S\. Boresch, F\. Tettinger, M\. Leitgeb, and M\. Karplus \(2003\)Absolute binding free energies: a quantitative approach for their calculation\.The Journal of Physical Chemistry B\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1)\.
- S\. Bottaro and K\. Lindorff\-Larsen \(2018\)Biophysical experiments and biomolecular simulations: a perfect match?\.Science\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p2.1)\.
- A\. M\. Bran, S\. Cox, O\. Schilter, C\. Baldassari, A\. D\. White, and P\. Schwaller \(2024\)Augmenting large language models with chemistry tools\.Nature Machine Intelligence\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p2.1),[Ethical Considerations](https://arxiv.org/html/2606.12916#Sx2.p1.1)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§A\.4](https://arxiv.org/html/2606.12916#A1.SS4.p1.4)\.
- Y\. Burda, H\. Edwards, A\. Storkey, and O\. Klimov \(2019\)Exploration by random network distillation\.InInternational Conference on Learning Representations,Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- Q\. Campbell, S\. Cox, J\. Medina, B\. Watterson, and A\. D\. White \(2025\)MDCrow: automating molecular dynamics workflows with large language models\.arXiv preprint arXiv:2502\.09565\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p3.1),[§2](https://arxiv.org/html/2606.12916#S2.p2.1)\.
- L\. Cao, M\. Šekutor, P\. Y\. Zavalij, K\. Mlinarić\-Majerski, R\. Glaser, and L\. Isaacs \(2014\)Cucurbit\[7\]uril⋅\\cdotguest pair with an attomolar dissociation constant\.Angewandte Chemie International Edition\.Cited by:[Figure 4](https://arxiv.org/html/2606.12916#S5.F4),[§5\.5](https://arxiv.org/html/2606.12916#S5.SS5.p1.13)\.
- D\. A\. Case, H\. M\. Aktulga, K\. Belfon, D\. S\. Cerutti, G\. A\. Cisneros, V\. W\. D\. Cruzeiro, N\. Forouzesh, T\. J\. Giese, A\. W\. Goetz, H\. Gohlke,et al\.\(2023\)AmberTools\.Journal of Chemical Information and Modeling\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p1.1)\.
- J\. S\. Chan, N\. Chowdhury, O\. Jaffe, J\. Aung, D\. Sherburn, E\. Mays, G\. Starace, K\. Liu, L\. Maksin, T\. Patwardhan, L\. Weng, and A\. Mądry \(2025\)MLE\-bench: evaluating machine learning agents on machine learning engineering\.InInternational Conference on Learning Representations,Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- A\. Chandrasekhar and A\. B\. Farimani \(2025\)Automating MD simulations for proteins using large language models: NAMD\-agent\.arXiv preprint arXiv:2507\.07887\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p2.1)\.
- G\. Chen, J\. Chen, L\. Chen, J\. Zhao, F\. Meng, W\. X\. Zhao, R\. Song, C\. Chen, J\. Wen, and K\. Jia \(2026\)Toward autonomous long\-horizon engineering for ML research\.arXiv preprint arXiv:2604\.13018\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p4.1)\.
- L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch \(2021\)Decision transformer: reinforcement learning via sequence modeling\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1)\.
- M\. Chen, Y\. Li, Y\. Yang, S\. Yu, B\. Lin, and X\. He \(2024\)AutoManual: constructing instruction manuals by LLM agents via interactive environmental learning\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p3.1)\.
- P\. F\. Christiano, J\. Leike, T\. B\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- Z\. Cournia, B\. Allen, and W\. Sherman \(2017\)Relative binding free energy calculations in drug discovery: recent advances and practical considerations\.Journal of chemical information and modeling\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p3.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.Nature\.Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- L\. Ding, J\. Carrillo, and C\. Do \(2025\)ToPolyAgent: AI agents for coarse\-grained topological polymer simulations\.arXiv preprint arXiv:2510\.12091\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2024\)Improving factuality and reasoning in language models through multiagent debate\.InInternational Conference on Machine Learning,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1),[§A\.6](https://arxiv.org/html/2606.12916#A1.SS6.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p5.2),[§4\.2](https://arxiv.org/html/2606.12916#S4.SS2.p3.6)\.
- P\. Eastman, J\. Swails, J\. D\. Chodera, R\. T\. McGibbon, Y\. Zhao, K\. A\. Beauchamp, L\. Wang, A\. C\. Simmonett, M\. P\. Harrigan, C\. D\. Stern, R\. P\. Wiewiora, B\. R\. Brooks, and V\. S\. Pande \(2017\)OpenMM 7: rapid development of high performance algorithms for molecular dynamics\.PLOS Computational Biology\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p1.1)\.
- A\. Ghafarollahi and M\. J\. Buehler \(2025\)Automating alloy design and discovery with physics\-aware multimodal multiagent AI\.Proceedings of the National Academy of Sciences\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- M\. K\. Gilson, J\. A\. Given, B\. L\. Bush, and J\. A\. McCammon \(1997\)The statistical\-thermodynamic basis for computation of binding affinities: a critical review\.Biophysical Journal\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1)\.
- S\. Guilbert, C\. Masschelein, J\. Goumaz, B\. Naida, and P\. Schwaller \(2025\)DynaMate: an autonomous agent for protein\-ligand molecular dynamics simulations\.arXiv preprint arXiv:2512\.10034\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p3.1),[§2](https://arxiv.org/html/2606.12916#S2.p2.1)\.
- R\. Gupta, J\. Hartford, and B\. Liu \(2025\)LLMs for Bayesian optimization in scientific domains: are we there yet?\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p4.1)\.
- L\. Hedges, A\. S\. J\. S\. Mey, C\. A\. Laughton, F\. L\. Gervasio, A\. J\. Mulholland, C\. J\. Woods, and J\. Michel \(2019\)BioSimSpace: an interoperable Python framework for biomolecular simulation\.Journal of Open Source Software\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1)\.
- N\. M\. Henriksen, A\. T\. Fenley, and M\. K\. Gilson \(2015\)Computational calorimetry: high\-precision calculation of host\-guest binding thermodynamics\.Journal of Chemical Theory and Computation\.Cited by:[§5\.4](https://arxiv.org/html/2606.12916#S5.SS4.p1.1)\.
- S\. A\. Hollingsworth and R\. O\. Dror \(2018\)Molecular dynamics simulation for all\.Neuron\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p1.1)\.
- J\. Jumper, R\. Evans, A\. Pritzel, T\. Green, M\. Figurnov, O\. Ronneberger, K\. Tunyasuvunakool, R\. Bates, A\. Žídek, A\. Potapenko, A\. Bridgland, C\. Meyer, S\. A\. A\. Kohl, A\. J\. Ballard, A\. Cowie, B\. Romera\-Paredes, S\. Nikolov, R\. Jain, J\. Adler, T\. Back, S\. Petersen, D\. Reiman, E\. Clancy, M\. Zielinski, M\. Steinegger, M\. Pacholska, T\. Berghammer, S\. Bodenstein, D\. Silver, O\. Vinyals, A\. W\. Senior, K\. Kavukcuoglu, P\. Kohli, and D\. Hassabis \(2021\)Highly accurate protein structure prediction with AlphaFold\.Nature\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p1.1)\.
- M\. Karplus and J\. A\. McCammon \(2002\)Molecular dynamics simulations of biomolecules\.Nature Structural Biology\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p1.1)\.
- M\. Khalifa, R\. Agarwal, L\. Logeswaran, J\. Kim, H\. Peng, M\. Lee, H\. Lee, and L\. Wang \(2025\)Process reward models that think\.arXiv preprint arXiv:2504\.16828\.Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- C\. Knox, M\. Wilson, C\. M\. Klinger, M\. Franklin, E\. Oler, A\. Wilson, A\. Pon, J\. Cox, N\. E\. \(\. Chin, S\. A\. Strawbridge, M\. Garcia\-Patino, R\. Kruger, A\. Sivakumaran, S\. Sanford, R\. Doshi, N\. Khetarpal, O\. Fatokun, D\. Doucet, A\. Zubkowski, D\. Y\. Rayat, H\. Jackson, K\. Harford, A\. Anjum, M\. Zakir, F\. Wang, S\. Tian, B\. Lee, J\. Liigand, H\. Peters, R\. Q\. \(\. Wang, T\. Nguyen, D\. So, M\. Sharp, R\. da Silva, C\. Gabriel, J\. Scantlebury, M\. Jasinski, D\. Ackerman, T\. Jewison, T\. Sajed, V\. Gautam, and D\. S\. Wishart \(2024\)DrugBank 6\.0: the DrugBank knowledgebase for 2024\.Nucleic Acids Research\.Cited by:[§5\.5](https://arxiv.org/html/2606.12916#S5.SS5.p1.13)\.
- M\. Laskin, L\. Wang, J\. Oh, E\. Parisotto, S\. Spencer, R\. Steigerwald, D\. Strouse, S\. Hansen, A\. Filos, E\. Brooks, M\. Gazeau, H\. Sahni, S\. Singh, and V\. Mnih \(2023\)In\-context reinforcement learning with algorithm distillation\.InInternational Conference on Learning Representations,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1)\.
- H\. Lee, S\. Phatale, H\. Mansoor, T\. Mesnard, J\. Ferret, K\. R\. Lu, C\. Bishop, E\. Hall, V\. Carbune, A\. Rastogi, and S\. Prakash \(2024\)RLAIF vs\. RLHF: scaling reinforcement learning from human feedback with AI feedback\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),Cited by:[§A\.6](https://arxiv.org/html/2606.12916#A1.SS6.p1.1)\.
- G\. Li, H\. A\. A\. K\. Hammoud, H\. Itani, D\. Khizbullin, and B\. Ghanem \(2023\)CAMEL: communicative agents for “mind” exploration of large language model society\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§A\.6](https://arxiv.org/html/2606.12916#A1.SS6.p1.1)\.
- T\. Liang, Z\. He, W\. Jiao, X\. Wang, Y\. Wang, R\. Wang, Y\. Yang, Z\. Tu, and S\. Shi \(2024\)Encouraging divergent thinking in large language models through multi\-agent debate\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§A\.6](https://arxiv.org/html/2606.12916#A1.SS6.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p5.2),[§4\.1](https://arxiv.org/html/2606.12916#S4.SS1.p1.2)\.
- S\. Liu, C\. Ruspic, P\. Mukhopadhyay, S\. Chakrabarti, P\. Y\. Zavalij, and L\. Isaacs \(2005\)The cucurbit\[n\]uril family: prime components for self\-sorting systems\.Journal of the American Chemical Society\.Cited by:[Figure 4](https://arxiv.org/html/2606.12916#S5.F4)\.
- Y\. Liu, C\. Si, K\. Narasimhan, and S\. Yao \(2025\)Contextual experience replay for self\-improvement of language agents\.InAnnual Meeting of the Association for Computational Linguistics,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1)\.
- C\. Lu, C\. Lu, R\. T\. Lange, J\. Foerster, J\. Clune, and D\. Ha \(2024\)The AI scientist: towards fully automated open\-ended scientific discovery\.arXiv preprint arXiv:2408\.06292\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p2.1)\.
- L\. Luo, Y\. Liu, R\. Liu, S\. Phatale, M\. Guo, H\. Lara, Y\. Li, L\. Shu, Y\. Zhu, L\. Meng, J\. Sun, and A\. Rastogi \(2024\)Improve mathematical reasoning in language models by automated process supervision\.arXiv preprint arXiv:2406\.06592\.Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- R\. Luo, Z\. Liu, X\. Liu, C\. Du, M\. Lin, W\. Chen, W\. Lu, and T\. Pang \(2025\)Language models can learn from verbal feedback without scalar rewards\.arXiv preprint arXiv:2509\.22638\.Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1)\.
- T\. Ma, Y\. Qian, Z\. Zhang, Z\. Wang, X\. Qian, F\. Bai, Y\. Ding, X\. Luo, S\. Zhang, K\. Murugesan, C\. Zhang, and Y\. Ye \(2025\)AutoData: a multi\-agent system for open web data collection\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§A\.6](https://arxiv.org/html/2606.12916#A1.SS6.p1.1)\.
- Y\. J\. Ma, W\. Liang, G\. Wang, D\. Huang, O\. Bastani, D\. Jayaraman, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024\)Eureka: human\-level reward design via coding large language models\.InInternational Conference on Learning Representations,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1)\.
- Y\. Ma, Z\. Wang, W\. Sun, Z\. Zhang, K\. Shi, N\. Chawla, and Y\. Ye \(2026a\)Policy4OOD: a knowledge\-guided world model for policy intervention simulation against the opioid overdose crisis\.arXiv preprint arXiv:2602\.12373\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- Z\. Ma, C\. Yang, Y\. Song, J\. Zhu, L\. Yang, L\. Xu, M\. Xiao, and X\. Jiang \(2026b\)MDAgent: a multi\-agent framework for end\-to\-end molecular dynamics research\.arXiv preprint arXiv:2604\.18622\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p3.1),[§2](https://arxiv.org/html/2606.12916#S2.p2.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.12916#S4.T1.15.15.2),[Table 1](https://arxiv.org/html/2606.12916#S4.T1.23.23.2),[Table 1](https://arxiv.org/html/2606.12916#S4.T1.8.8.2),[§5\.1](https://arxiv.org/html/2606.12916#S5.SS1.p2.1)\.
- A\. Merchant, S\. Batzner, S\. S\. Schoenholz, M\. Aykol, G\. Cheon, and E\. D\. Cubuk \(2023a\)Scaling deep learning for materials discovery\.Nature\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p2.1)\.
- A\. Merchant, S\. Batzner, S\. S\. Schoenholz, M\. Aykol, G\. Cheon, and E\. D\. Cubuk \(2023b\)Scaling deep learning for materials discovery\.Nature\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p2.1)\.
- A\. S\. J\. S\. Mey, B\. K\. Allen, H\. E\. Bruce McDonald, J\. D\. Chodera, D\. F\. Hahn, M\. Kuhn, J\. Michel, D\. L\. Mobley, L\. N\. Naden, S\. Prasad, A\. Rizzi, J\. Scheen, M\. R\. Shirts, G\. Tresadern, and H\. Xu \(2020\)Best practices for alchemical free energy calculations \[article v1\.0\]\.Living Journal of Computational Molecular Science\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p2.1),[§2](https://arxiv.org/html/2606.12916#S2.p1.1)\.
- D\. L\. Mobley and M\. K\. Gilson \(2017\)Predicting binding free energies: frontiers and benchmarks\.Annual Review of Biophysics\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p2.1)\.
- W\. L\. Mock and N\. Shih \(1986\)Structure and selectivity in host\-guest complexes of cucurbituril\.Journal of Organic Chemistry\.Cited by:[Figure 4](https://arxiv.org/html/2606.12916#S5.F4)\.
- S\. Moghaddam, C\. Yang, M\. Rekharsky, Y\. H\. Ko, K\. Kim, Y\. Inoue, and M\. K\. Gilson \(2011\)New ultrahigh affinity host\-guest complexes of cucurbit\[7\]uril with bicyclo\[2\.2\.2\]octane and adamantane guests: thermodynamic analysis and evaluation of m2 affinity calculations\.Journal of the American Chemical Society\.Cited by:[Figure 4](https://arxiv.org/html/2606.12916#S5.F4)\.
- H\. S\. Muddana, A\. T\. Fenley, D\. L\. Mobley, and M\. K\. Gilson \(2014\)The SAMPL4 host\-guest blind prediction challenge: an overview\.Journal of Computer\-Aided Molecular Design\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p5.2),[§5\.1](https://arxiv.org/html/2606.12916#S5.SS1.p1.3)\.
- A\. Y\. Ng, D\. Harada, and S\. J\. Russell \(1999\)Policy invariance under reward transformations: theory and application to reward shaping\.InInternational Conference on Machine Learning,Cited by:[§A\.4](https://arxiv.org/html/2606.12916#A1.SS4.p1.4),[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- F\. Noé, S\. Olsson, J\. Köhler, and H\. Wu \(2019\)Boltzmann generators: sampling equilibrium states of many\-body systems with deep learning\.Science\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p1.1)\.
- O\. O’Donoghue, A\. Shtedritski, J\. Ginger, R\. Abboud, A\. Ghareeb, and S\. Rodriques \(2023\)BioPlanner: automatic evaluation of LLMs on protocol planning in biology\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InACM Symposium on User Interface Software and Technology,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1)\.
- D\. Pathak, P\. Agrawal, A\. A\. Efros, and T\. Darrell \(2017\)Curiosity\-driven exploration by self\-supervised prediction\.InInternational Conference on Machine Learning,Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- C\. Qian, W\. Liu, H\. Liu, N\. Chen, Y\. Dang, J\. Li, C\. Yang, W\. Chen, Y\. Su, X\. Cong, J\. Xu, D\. Li, Z\. Liu, and M\. Sun \(2024\)ChatDev: communicative agents for software development\.InAnnual Meeting of the Association for Computational Linguistics,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1),[§A\.6](https://arxiv.org/html/2606.12916#A1.SS6.p1.1)\.
- Y\. Qu, K\. Huang, M\. Yin, K\. Zhan, D\. Liu, D\. Yin, H\. C\. Cousins, W\. A\. Johnson, X\. Wang, M\. Shah, R\. B\. Altman, D\. Zhou, M\. Wang, and L\. Cong \(2024\)CRISPR\-GPT for agentic automation of gene\-editing experiments\.arXiv preprint arXiv:2404\.18021\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- M\. V\. Rekharsky, T\. Mori, C\. Yang, Y\. H\. Ko, N\. Selvapalam, H\. Kim, D\. Sobransingh, A\. E\. Kaifer, S\. Liu, L\. Isaacs, W\. Chen, S\. Moghaddam, M\. K\. Gilson, K\. Kim, and Y\. Inoue \(2007\)A synthetic host\-guest system achieves avidin\-biotin affinity by overcoming enthalpy\-entropy compensation\.Proceedings of the National Academy of Sciences\.Cited by:[Figure 4](https://arxiv.org/html/2606.12916#S5.F4)\.
- B\. Romera\-Paredes, M\. Barekatain, A\. Novikov, M\. Balog, M\. P\. Kumar, E\. Dupont, F\. J\. R\. Ruiz, J\. S\. Ellenberg, P\. Wang, O\. Fawzi, P\. Kohli, and A\. Fawzi \(2024\)Mathematical discoveries from program search with large language models\.Nature\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p2.1)\.
- J\. Ross, B\. Belgodere, V\. Chenthamarakshan, I\. Padhi, Y\. Mroueh, and P\. Das \(2022\)Large\-scale chemical language representations capture molecular structure and properties\.Nature Machine Intelligence\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p2.1)\.
- C\. E\. M\. Schindler, H\. Baumann, A\. Blum, D\. Bose, H\. Buchstaller, L\. Burgdorf, D\. Cappel, E\. Chekler, P\. Czodrowski, D\. Dorsch, M\. K\. I\. Eguida, B\. Follows, T\. Fuchss, U\. Grädler, J\. Gunera, T\. Johnson, C\. Jorand Lebrun, S\. Karra, M\. Klein, T\. Knehans, L\. Koetzner, M\. Krier, M\. Leiendecker, B\. Leuthner, L\. Li, I\. Mochalkin, D\. Musil, C\. Neagu, F\. Rippmann, K\. Schiemann, R\. Schulz, T\. Steinbrecher, E\. Tanzer, A\. Unzue Lopez, A\. Viacava Follis, A\. Wegener, and D\. Kuhn \(2020\)Large\-scale assessment of binding free energy calculations in active drug discovery projects\.Journal of Chemical Information and Modeling\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p2.1)\.
- I\. Schlag, K\. Irie, and J\. Schmidhuber \(2021\)Linear transformers are secretly fast weight programmers\.InInternational Conference on Machine Learning,Cited by:[§4\.1](https://arxiv.org/html/2606.12916#S4.SS1.p1.2)\.
- S\. Schmidgall, Y\. Su, Z\. Wang, X\. Sun, J\. Wu, X\. Yu, J\. Liu, M\. Moor, Z\. Liu, and E\. Barsoum \(2025\)Agent laboratory: using LLM agents as research assistants\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- J\. Schmidhuber \(1992\)Learning to control fast\-weight memories: an alternative to dynamic recurrent networks\.Neural Computation\.Cited by:[§4\.1](https://arxiv.org/html/2606.12916#S4.SS1.p1.2)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§A\.4](https://arxiv.org/html/2606.12916#A1.SS4.p1.4)\.
- B\. Shahriari, K\. Swersky, Z\. Wang, R\. P\. Adams, and N\. de Freitas \(2016\)Taking the human out of the loop: a review of Bayesian optimization\.Proceedings of the IEEE\.Cited by:[§A\.4](https://arxiv.org/html/2606.12916#A1.SS4.p1.4)\.
- Z\. Shi, H\. A, Y\. Shao, D\. Huang, H\. An, C\. Xin, H\. Shen, Z\. Wang, Y\. Na, G\. Huang, and X\. Jing \(2026\)MDAgent2: large language model for code generation and knowledge Q&A in molecular dynamics\.arXiv preprint arXiv:2601\.02075\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- Z\. Shi, C\. Xin, T\. Huo, Y\. Jiang, B\. Wu, X\. Chen, W\. Qin, X\. Ma, G\. Huang, Z\. Wang, and X\. Jing \(2025\)A fine\-tuned large language model based molecular dynamics agent for code generation to obtain material thermodynamic parameters\.Scientific Reports\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p2.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p4.1),[§4\.1](https://arxiv.org/html/2606.12916#S4.SS1.p1.2),[Table 1](https://arxiv.org/html/2606.12916#S4.T1.10.10.2),[Table 1](https://arxiv.org/html/2606.12916#S4.T1.17.17.2),[Table 1](https://arxiv.org/html/2606.12916#S4.T1.25.25.2),[§5\.1](https://arxiv.org/html/2606.12916#S5.SS1.p2.1)\.
- M\. R\. Shirts and J\. D\. Chodera \(2008\)Statistically optimal analysis of samples from multiple equilibrium states\.The Journal of Chemical Physics\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p1.1),[Table 2](https://arxiv.org/html/2606.12916#S5.T2.12.12.1.1.1)\.
- D\. R\. Slochower, N\. M\. Henriksen, L\. Wang, J\. D\. Chodera, D\. L\. Mobley, and M\. K\. Gilson \(2019\)Binding thermodynamics of host\-guest systems with SMIRNOFF99Frosst 1\.0\.5 from the Open Force Field Initiative\.Journal of Chemical Theory and Computation\.Cited by:[§5\.4](https://arxiv.org/html/2606.12916#S5.SS4.p1.1),[Table 2](https://arxiv.org/html/2606.12916#S5.T2)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2025\)Scaling llm test\-time compute optimally can be more effective than scaling parameters for reasoning\.InInternational Conference on Learning Representations,Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- J\. Son, S\. Lee, and G\. Kim \(2025\)Distilling reinforcement learning algorithms for in\-context model\-based planning\.InInternational Conference on Learning Representations,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1)\.
- N\. J\. Szymanski, B\. Rendy, Y\. Fei, R\. E\. Kumar, T\. He, D\. Milsted, M\. J\. McDermott, M\. Gallant, E\. D\. Cubuk, A\. Merchant, H\. Kim, A\. Jain, C\. J\. Bartel, K\. Persson, Y\. Zeng, and G\. Ceder \(2023\)An autonomous laboratory for the accelerated synthesis of inorganic materials\.Nature\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- P\. Trirat, W\. Jeong, and S\. J\. Hwang \(2025\)AutoML\-Agent: a multi\-agent LLM framework for full\-pipeline AutoML\.InInternational Conference on Machine Learning,Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- J\. Uesato, N\. Kushman, R\. Kumar, F\. Song, N\. Siegel, L\. Wang, A\. Creswell, G\. Irving, and I\. Higgins \(2022\)Solving math word problems with process\- and outcome\-based feedback\.arXiv preprint arXiv:2211\.14275\.Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p5.2)\.
- O\. T\. Unke, S\. Chmiela, H\. E\. Sauceda, M\. Gastegger, I\. Poltavsky, K\. T\. Schütt, A\. Tkatchenko, and K\. Müller \(2021\)Machine learning force fields\.Chemical Reviews\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p1.1),[§2](https://arxiv.org/html/2606.12916#S2.p1.1)\.
- D\. van Tilborg, A\. Alenicheva, and F\. Grisoni \(2022\)Exposing the limitations of molecular machine learning with activity cliffs\.Journal of Chemical Information and Modeling\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p2.1)\.
- A\. S\. Vezhnevets, S\. Osindero, T\. Schaul, N\. Heess, M\. Jaderberg, D\. Silver, and K\. Kavukcuoglu \(2017\)FeUdal networks for hierarchical reinforcement learning\.InInternational Conference on Machine Learning,Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- A\. Vriza, U\. Kornu, A\. Koneru, H\. Chan, and S\. K\. R\. S\. Sankaranarayanan \(2026\)Multi\-agentic AI framework for end\-to\-end atomistic simulations\.Digital Discovery\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024a\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1),[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p4.1),[§2](https://arxiv.org/html/2606.12916#S2.p2.1)\.
- L\. Wang, Y\. Wu, Y\. Deng, B\. Kim, L\. Pierce, G\. Krilov, D\. Lupyan, S\. Robinson, M\. K\. Dahlgren, J\. Greenwood, D\. L\. Romero, C\. Masse, J\. L\. Knight, T\. Steinbrecher, T\. Beuming, W\. Damm, E\. Harder, W\. Sherman, M\. Brewer, R\. Wester, M\. Murcko, L\. Frye, R\. Farid, T\. Lin, D\. L\. Mobley, W\. L\. Jorgensen, B\. J\. Berne, R\. A\. Friesner, and R\. Abel \(2015\)Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free\-energy calculation protocol and force field\.Journal of the American Chemical Society\.Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024b\)Math\-shepherd: verify and reinforce LLMs step\-by\-step without human annotations\.InAnnual Meeting of the Association for Computational Linguistics,Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p5.2)\.
- Z\. Wang, X\. Han, Q\. Yang, X\. Tang, F\. Wu, X\. Guo, W\. Sun, T\. Ma, P\. Lio, S\. Wang, C\. Zhang, and Y\. Ye \(2026a\)Molecular representations in implicit functional space via hyper\-networks\.arXiv preprint arXiv:2601\.22327\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p2.1)\.
- Z\. Wang, F\. Wu, H\. Wang, X\. Tang, B\. Li, Z\. Yin, Y\. Ma, Y\. Li, W\. Sun, X\. Chen, and Y\. Ye \(2026b\)Why reasoning fails to plan: a planning\-centric analysis of long\-horizon decision making in LLM agents\.arXiv preprint arXiv:2601\.22311\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p4.1)\.
- F\. Wu, W\. Xuan, H\. Qi, H\. Cao, H\. Chang, Z\. Zhou, H\. Zhao, M\. Jian, C\. Ma, Y\. Cheng, K\. Pang, X\. Tang, Z\. Wang, G\. Li, H\. Wang, K\. Ying, P\. Lu, C\. Im, S\. Han, P\. Xia, T\. Xu, Y\. Li, D\. Zhu, P\. Heng, N\. Yokoya, M\. Sugiyama, L\. E\. Li, J\. Leskovec, and Y\. Choi \(2026\)Proteo\-R1: reasoning foundation models for de novo protein design\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§A\.1](https://arxiv.org/html/2606.12916#A1.SS1.p1.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2024\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversation\.InConference on Language Modeling \(COLM\),Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1),[§A\.6](https://arxiv.org/html/2606.12916#A1.SS6.p1.1)\.
- R\. Wu, X\. Wang, J\. Mei, P\. Cai, D\. Fu, C\. Yang, L\. Wen, X\. Yang, Y\. Shen, Y\. Wang, and B\. Shi \(2025\)EvolveR: self\-evolving LLM agents through an experience\-driven lifecycle\.arXiv preprint arXiv:2510\.16079\.Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1)\.
- Z\. Wu, B\. Ramsundar, E\. N\. Feinberg, J\. Gomes, C\. Geniesse, A\. S\. Pappu, K\. Leswing, and V\. Pande \(2018\)MoleculeNet: a benchmark for molecular machine learning\.Chemical Science\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p2.1)\.
- Z\. Xie, J\. Chen, L\. Chen, W\. Mao, J\. Xu, and L\. Kong \(2025\)Teaching language models to critique via reinforcement learning\.InInternational Conference on Machine Learning,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2024a\)Large language models as optimizers\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1)\.
- F\. Yang and J\. D\. Evans \(2026\)QUASAR: a universal autonomous system for atomistic simulation and a benchmark of its capabilities\.arXiv preprint arXiv:2602\.00185\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press \(2024b\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- K\. Yang, K\. Swanson, W\. Jin, C\. Coley, P\. Eiden, H\. Gao, A\. Guzman\-Perez, T\. Hopper, B\. Kelley, M\. Mathea, A\. Palmer, V\. Settels, T\. Jaakkola, K\. Jensen, and R\. Barzilay \(2019\)Analyzing learned molecular representations for property prediction\.Journal of Chemical Information and Modeling\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p2.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p3.1),[Table 1](https://arxiv.org/html/2606.12916#S4.T1.16.16.2),[Table 1](https://arxiv.org/html/2606.12916#S4.T1.24.24.2),[Table 1](https://arxiv.org/html/2606.12916#S4.T1.9.9.2),[§5\.1](https://arxiv.org/html/2606.12916#S5.SS1.p2.1)\.
- X\. Ye, Y\. Mao, J\. Zhang, Y\. Liu, L\. Hao, F\. Wu, Z\. Li, Y\. Liao, Z\. Wang, Y\. Wu, Z\. Liu, Z\. Yin, L\. Yuan, P\. Torr, H\. Sun, X\. Zeng, M\. Wang, L\. Cong, S\. Gao, and X\. Tang \(2026\)LatentChem: from textual CoT to latent thinking in chemical reasoning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p2.1)\.
- J\. Yin, N\. M\. Henriksen, D\. R\. Slochower, M\. R\. Shirts, M\. W\. Chiu, D\. L\. Mobley, and M\. K\. Gilson \(2017\)Overview of the SAMPL5 host\-guest challenge: are we doing better?\.Journal of Computer\-Aided Molecular Design\.Cited by:[§1](https://arxiv.org/html/2606.12916#S1.p5.2),[§5\.2](https://arxiv.org/html/2606.12916#S5.SS2.p1.8)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, P\. Lu, Z\. Huang, C\. Guestrin, and J\. Zou \(2025\)Optimizing generative AI by backpropagating language model feedback\.Nature\.Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1)\.
- B\. Zdrazil, E\. Felix, F\. Hunter, E\. J\. Manners, J\. Blackshaw, S\. Corbett, M\. de Veij, H\. Ioannidis, D\. M\. Lopez, J\. F\. Mosquera, M\. P\. Magariños, N\. Bosc, R\. Arcila, T\. Kizilören, A\. Gaulton, A\. P\. Bento, M\. F\. Adasme, P\. Monecke, G\. A\. Landrum, and A\. R\. Leach \(2024\)The chembl database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods\.Nucleic Acids Research\.Cited by:[§5\.5](https://arxiv.org/html/2606.12916#S5.SS5.p1.13)\.
- D\. Zhang, S\. Zhoubian, Z\. Hu, Y\. Yue, Y\. Dong, and J\. Tang \(2024\)ReST\-MCTS\*: LLM self\-training via process reward guided tree search\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- Z\. Zhang, C\. Zheng, Y\. Wu, B\. Zhang, R\. Lin, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025\)The lessons of developing process reward models in mathematical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- A\. Zhao, A\. Chandrasekhar, and A\. B\. Farimani \(2026\)PolyJarvis: LLM agent for autonomous polymer MD simulations\.arXiv preprint arXiv:2604\.02537\.Cited by:[§A\.2](https://arxiv.org/html/2606.12916#A1.SS2.p1.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)ExpeL: LLM agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§A\.3](https://arxiv.org/html/2606.12916#A1.SS3.p1.1),[§1](https://arxiv.org/html/2606.12916#S1.p3.1)\.
- C\. Zheng, Z\. Zhang, B\. Zhang, R\. Lin, K\. Lu, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025\)ProcessBench: identifying process errors in mathematical reasoning\.InAnnual Meeting of the Association for Computational Linguistics,Cited by:[§A\.5](https://arxiv.org/html/2606.12916#A1.SS5.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and Chatbot Arena\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§A\.6](https://arxiv.org/html/2606.12916#A1.SS6.p1.1)\.

## Appendix AComprehensive Related Work

### A\.1Molecular Dynamics

MDForge sits atop the established methodological stack of physics\-based MD rather than competing with any of its parts\. The binding free\-energy task we target is computed by the alchemical FEP/TI family with BAR/MBAR estimators\(Bennett,[1976](https://arxiv.org/html/2606.12916#bib.bib95); Shirts and Chodera,[2008](https://arxiv.org/html/2606.12916#bib.bib96); Meyet al\.,[2020](https://arxiv.org/html/2606.12916#bib.bib77)\); absolute free\-energy workflows additionally rely on standard\-state restraint corrections\(Boreschet al\.,[2003](https://arxiv.org/html/2606.12916#bib.bib97); Gilsonet al\.,[1997](https://arxiv.org/html/2606.12916#bib.bib98)\)and have been benchmarked extensively in the relative regime\(Wanget al\.,[2015](https://arxiv.org/html/2606.12916#bib.bib99); Courniaet al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib62)\)\. The executable pipelines our agent emits target mature engines \(OpenMM, GROMACS, AMBER\)\(Eastmanet al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib100); Abrahamet al\.,[2015](https://arxiv.org/html/2606.12916#bib.bib101); Caseet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib102)\)and biomolecular force\-field families \(AMBER, GAFF/OpenFF, TIP3P/OPC water\), which form the action vocabulary the LLM composes into rather than targets it tries to improve\. A parallel line of work proposes neural surrogates for parts of the pipeline: ML force fields trained to replace classical potentials\(Behler,[2021](https://arxiv.org/html/2606.12916#bib.bib75); Unkeet al\.,[2021](https://arxiv.org/html/2606.12916#bib.bib76)\), end\-to\-end structure predictors\(Jumperet al\.,[2021](https://arxiv.org/html/2606.12916#bib.bib3); Wuet al\.,[2026](https://arxiv.org/html/2606.12916#bib.bib120)\), and neural samplers that draw equilibrium configurations directly\(Noéet al\.,[2019](https://arxiv.org/html/2606.12916#bib.bib103)\)\. Each replaces a single slice \(a force, a static structure, an equilibrium sample\), but none yields the staged, diagnostic\-emitting trajectory whose design MDForge automates\. Earlier non\-LLM frameworks \(BioSimSpace, OpenFE, perses\)\(Hedgeset al\.,[2019](https://arxiv.org/html/2606.12916#bib.bib104)\)also templatize parts of this workflow; MDForge instead emits the workflow itself as code, so the design space is the program\-synthesis space rather than the parameter space of a fixed template\.

### A\.2Autonomous Science Agents

LLM\-driven scientific agents integrate literature search, hypothesis proposal, code synthesis, and outer\-loop evaluation into runnable discovery pipelines\(Boikoet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib37); Romera\-Paredeset al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib38); Luet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib39); Branet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib52); Baeket al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib61); Schmidgallet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib60); Chenet al\.,[2026](https://arxiv.org/html/2606.12916#bib.bib40); Maet al\.,[2026a](https://arxiv.org/html/2606.12916#bib.bib122)\), with parallel instantiations in materials discovery and autonomous laboratories\(Merchantet al\.,[2023a](https://arxiv.org/html/2606.12916#bib.bib53); Szymanskiet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib54)\), biological experiment and protocol planning\(Quet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib55); O’Donoghueet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib56)\), and software\- or ML\-engineering pipeline automation\(Yanget al\.,[2024b](https://arxiv.org/html/2606.12916#bib.bib57); Triratet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib58); Chanet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib59)\)\. Closest to our setting, MD\-specific agents have converged on a*tool\-calling*pattern in which an LLM orchestrates a fixed library of simulation engines, force fields, and analysis libraries into an executable workflow\. MDCrow\(Campbellet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib43)\)provides a general\-purpose toolset over force\-field setup, simulation, and trajectory analysis, exposed to an LLM through LangChain\-style tool calls\. MDAgent\(Maet al\.,[2026b](https://arxiv.org/html/2606.12916#bib.bib47)\)extends the pattern with a case\-based skill\-and\-memory module that retrieves prior task knowledge across trials, the most ambitious recent attempt at inter\-trial improvement within the tool\-calling paradigm\. Similar tool\-calling patterns extend to binding free\-energy workflows\(Guilbertet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib41)\), polymer and topological MD\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.12916#bib.bib42); Dinget al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib48)\), NAMD\-based protein simulations\(Chandrasekhar and Farimani,[2025](https://arxiv.org/html/2606.12916#bib.bib44)\), alloy design\(Ghafarollahi and Buehler,[2025](https://arxiv.org/html/2606.12916#bib.bib50)\), broader atomistic settings\(Vrizaet al\.,[2026](https://arxiv.org/html/2606.12916#bib.bib49); Yang and Evans,[2026](https://arxiv.org/html/2606.12916#bib.bib51)\), and to fine\-tuned LLM variants that emit MD scripts directly\(Shiet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib45),[2026](https://arxiv.org/html/2606.12916#bib.bib46)\)\. In contrast, MDForge treats MD pipeline design as*open\-ended code generation*, placing it in the lineage of program\-synthesis agents that emit arbitrary executable code rather than select from a fixed tool library\(Wanget al\.,[2024a](https://arxiv.org/html/2606.12916#bib.bib11); Romera\-Paredeset al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib38)\); unlike these, MDForge operates in a regime whose only supervisory signal is sparse \(one terminal reward per full pipeline run\) and expensive \(GPU\-hours of MD execution per trial\)\.

### A\.3Verbal Reinforcement Learning

In\-context learning enables an LLM agent to reshape its behavior online by manipulating context rather than parameters, requiring no gradient update\. Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib8)\)formalized this as*verbal reinforcement learning*: an LLM agent attempts a trial, receives a textual outcome label, generates a natural\-language reflection, and consumes that reflection as additional context on the next trial; recent work further formalizes this verbal\-feedback channel without scalar rewards\(Luoet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib63)\)\. Subsequent work extends this paradigm with reasoning\-action interleaving\(Yaoet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib9); Madaanet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib10)\), persistent skill libraries\(Wanget al\.,[2024a](https://arxiv.org/html/2606.12916#bib.bib11); Wuet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib65)\), experiential memory\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib12); Parket al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib13); Liuet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib64)\), multi\-agent coordination\(Wuet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib14); Duet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib15); Qianet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib16)\), in\-context RL\(Laskinet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib17); Chenet al\.,[2021](https://arxiv.org/html/2606.12916#bib.bib18); Sonet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib67)\), textual optimization that treats verbal feedback as a gradient\-like signal over prompts or programs\(Yanget al\.,[2024a](https://arxiv.org/html/2606.12916#bib.bib88); Yuksekgonulet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib89)\), and LLM\-generated dense reward or critique\(Maet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib19); Chenet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib20); Xieet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib66)\)\. However, all these methods rely on a supervisory signal that is both*rich*and*cheap*, typically validated on benchmarks where per\-trial cost is seconds to minutes \(HotpotQA, AlfWorld, code unit tests\)\. We extend verbal RL to the opposite regime: each trial costs GPU\-hours of MD execution and yields only a single scalar at horizon end\.

### A\.4Why Verbal RL for MDForge

Becauseθ\\thetais fixed, all learning is carried by the context rewrite𝒞t↦𝒞t\+1\\mathcal\{C\}\_\{t\}\\mapsto\\mathcal\{C\}\_\{t\+1\}; the design of𝒞t\\mathcal\{C\}\_\{t\}is therefore the operative learning rule\. No other update rule fits the regime: classical sparse\-reward methods \(reward shaping\(Nget al\.,[1999](https://arxiv.org/html/2606.12916#bib.bib23)\), Bayesian optimization\(Shahriariet al\.,[2016](https://arxiv.org/html/2606.12916#bib.bib25)\), hindsight relabeling\(Andrychowiczet al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib24)\)\) presume a feature space thatΠ\\Pidoes not admit; gradient\-based RL\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib105)\)demands orders of magnitude more rollouts; static few\-shot prompting\(Brownet al\.,[2020](https://arxiv.org/html/2606.12916#bib.bib106)\)absorbs no trial signal at all\. Verbal RL alone \(i\) treats the trial signal as text, so heterogeneous per\-stage diagnostics fold back in without a fixed feature representation, \(ii\) keeps the policy a frozen LLM, so per\-trial updates cost zero parameter passes, and \(iii\) keeps the action space open\-ended\.

### A\.5Process Supervision

Process supervision densifies a sparse outcome signal by attaching intermediate scalar rewards to individual reasoning or decision steps\. This idea spans modern language\-model verification and test\-time search\(Uesatoet al\.,[2022](https://arxiv.org/html/2606.12916#bib.bib26); Cobbeet al\.,[2021](https://arxiv.org/html/2606.12916#bib.bib27); Lightmanet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib21); Wanget al\.,[2024b](https://arxiv.org/html/2606.12916#bib.bib28); Luoet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib29); Zhanget al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib30); DeepSeek\-AI,[2025](https://arxiv.org/html/2606.12916#bib.bib22); Snellet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib36); Zhanget al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib68); Khalifaet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib69); Zhenget al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib70)\), as well as classical reinforcement\-learning densification\(Nget al\.,[1999](https://arxiv.org/html/2606.12916#bib.bib23); Andrychowiczet al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib24); Arjona\-Medinaet al\.,[2019](https://arxiv.org/html/2606.12916#bib.bib32); Vezhnevetset al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib33); Pathaket al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib34); Burdaet al\.,[2019](https://arxiv.org/html/2606.12916#bib.bib35); Christianoet al\.,[2017](https://arxiv.org/html/2606.12916#bib.bib31)\)\. However, all these methods assume scalar feedback consumed via gradient\-based parameter updates\. We extend process supervision to the verbal, in\-context regime: the densified signal is typed natural\-language critique consumed without any parameter update\.

### A\.6Multi\-Agent Debate

A line of work casts inference\-time reasoning as deliberation between multiple LLM instances, either as peers exchanging arguments to converge on a more reliable answer\(Duet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib15); Lianget al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib90)\)or as evaluators substituting for human judges\(Zhenget al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib91); Baiet al\.,[2022](https://arxiv.org/html/2606.12916#bib.bib93); Leeet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib92)\), with broader role\-decomposition extending to multi\-agent software\- and task\-execution frameworks\(Wuet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib14); Qianet al\.,[2024](https://arxiv.org/html/2606.12916#bib.bib16); Liet al\.,[2023](https://arxiv.org/html/2606.12916#bib.bib94); Maet al\.,[2025](https://arxiv.org/html/2606.12916#bib.bib121)\)\. These methods treat experts as interchangeable reasoners differentiated only by prompt persona, and the quantity they produce is a single converged judgment over a shared query\. The PRISM panel inside MDForge departs on both counts: each expert is anchored to a fixed physics subsystem \(force field, sampling, analysis\) with non\-overlapping jurisdiction, and the output is a typed, subsystem\-attributable critique that names which part of the pipeline to edit rather than a consensus verdict, with a slow cross\-task loop reweighting per\-expert reputations from their pre\-trial vs\. post\-execution consistency\.

## Appendix BMethod Complement

This appendix complements the high\-level description of MDForge in §[4](https://arxiv.org/html/2606.12916#S4)with the three technical pieces a reader needs to reproduce the system: the full PRISM pseudocode, a per\-agent summary of the prompts that drive every LLM call together with the debate protocol and aggregator, and the reputation update\. Verbatim prompts are released with the code\.

### B\.1PRISM Pseudocode

See Algorithm[1](https://arxiv.org/html/2606.12916#alg1)\.

Algorithm 1MDForge’s verbal RL update\.1:task

TT\(host \+ guest set\); expert panel

E=\{eFF,eSamp,eAnal\}E=\\\{e\_\{\\mathrm\{FF\}\},e\_\{\\mathrm\{Samp\}\},e\_\{\\mathrm\{Anal\}\}\\\}; successful\-trial budget

NN; failed\-revision cap

MM;

K=4K\{=\}4stages

2:

drec←DesignDiscussion\(E,T\)d\_\{\\text\{rec\}\}\\leftarrow\\textsc\{DesignDiscussion\}\(E,T\)⊳\\trianglerightPhase 0: one\-off, web\-search enabled

3:

π←CodeAgent\.Propose\(T,drec\)\\pi\\leftarrow\\textsc\{CodeAgent\.Propose\}\(T,d\_\{\\text\{rec\}\}\)
4:

n←0n\\leftarrow 0;

m←0m\\leftarrow 0;

H←∅H\\leftarrow\\emptyset;

π∗←⊥\\pi^\{\*\}\\leftarrow\\bot
5:while

n<Nn<Nand

m<Mm<Mand not convergeddo

6:⊳\\triangleright*Phase 1: static and mechanical screening*

7:

cL1←Layer1\.Verify\(π\)c\_\{L1\}\\leftarrow\\textsc\{Layer1\.Verify\}\(\\pi\)
8:if

cL1\.label=Failc\_\{L1\}\.\\text\{label\}=\\textsc\{Fail\}then

9:

m←m\+1m\\leftarrow m\+1;

π←CodeAgent\.Revise\(π,\[cL1\]\)\\pi\\leftarrow\\textsc\{CodeAgent\.Revise\}\(\\pi,\[c\_\{L1\}\]\);continue⊳\\trianglerightbypass panel

10:endif

11:

π←Engineer\.Debug\(π,T\)\\pi\\leftarrow\\textsc\{Engineer\.Debug\}\(\\pi,T\)⊳\\trianglerightmechanical fixes only; no methodological changes

12:⊳\\triangleright*Phase 2: pre\-execution verbal screening*

13:

\(cpre,Spre\)←Aggregate\(\{ei\.PreEval\(π\)\}i\)\(c\_\{\\text\{pre\}\},S\_\{\\text\{pre\}\}\)\\leftarrow\\textsc\{Aggregate\}\(\\\{e\_\{i\}\.\\textsc\{PreEval\}\(\\pi\)\\\}\_\{i\}\)⊳\\trianglerightadvisory

14:⊳\\triangleright*Phase 3: engineer\-led execution*

15:

D←Engineer\.Run\(π,T\)D\\leftarrow\\textsc\{Engineer\.Run\}\(\\pi,T\)⊳\\trianglerightdebug \+ run stages1\.\.K1\.\.Kin sandbox

16:ifany required stage fails or no finite

ΔG^\\hat\{\\Delta G\}is producedthen

17:

m←m\+1m\\leftarrow m\+1;

π←CodeAgent\.Revise\(π,\[cL1,cpre,D\]\)\\pi\\leftarrow\\textsc\{CodeAgent\.Revise\}\(\\pi,\[c\_\{L1\},c\_\{\\text\{pre\}\},D\]\);continue

18:endif

19:

B←MultiMoleculeBenchmark\(π,T\)B\\leftarrow\\textsc\{MultiMoleculeBenchmark\}\(\\pi,T\)⊳\\trianglerightparallel over guest set

20:

\(cpost,Spost\)←Aggregate\(\{ei\.PostEval\(π,B\)\}i\)\(c\_\{\\text\{post\}\},S\_\{\\text\{post\}\}\)\\leftarrow\\textsc\{Aggregate\}\(\\\{e\_\{i\}\.\\textsc\{PostEval\}\(\\pi,B\)\\\}\_\{i\}\)
21:⊳\\triangleright*Phase 4: sparse outcome reward*

22:

r∗←−MAE\(B\)r^\{\*\}\\leftarrow\-\\,\\textsc\{MAE\}\(B\);

n←n\+1n\\leftarrow n\+1;

m←0m\\leftarrow 0
23:if

cpost\.label=Passc\_\{\\text\{post\}\}\.\\text\{label\}=\\textsc\{Pass\}and

Spost≥θpassS\_\{\\text\{post\}\}\\geq\\theta\_\{\\text\{pass\}\}then

24:

π∗←π\\pi^\{\*\}\\leftarrow\\pi;break

25:endif

26:⊳\\triangleright*Phase 5: in\-context Code agent update*

27:

Kt←\{cL1,cpre,cpost\}K\_\{t\}\\leftarrow\\\{c\_\{L1\},c\_\{\\text\{pre\}\},c\_\{\\text\{post\}\}\\\};

H←H∪\{D,B,r∗\}H\\leftarrow H\\cup\\\{D,B,r^\{\*\}\\\}
28:

𝒞t←\{T,π,Kt,H\}\\mathcal\{C\}\_\{t\}\\leftarrow\\\{T,\\,\\pi,\\,K\_\{t\},\\,H\\\}⊳\\trianglerightKtK\_\{t\}= critique set;HH= prior trial summaries

29:

π←CodeAgent\.Revise\(π,𝒞t\)\\pi\\leftarrow\\textsc\{CodeAgent\.Revise\}\(\\pi,\\mathcal\{C\}\_\{t\}\)
30:⊳\\triangleright*Phase 6: per\-task slow\-loop reputation*

31:

ρ←UpdateReputations\(ρ,cpre\.votes,cpost\.votes\)\\rho\\leftarrow\\textsc\{UpdateReputations\}\(\\rho,c\_\{\\text\{pre\}\}\.\\text\{votes\},c\_\{\\text\{post\}\}\.\\text\{votes\}\)
32:endwhile

33:return

π∗\\pi^\{\*\}if converged, else last runnable

π\\pi

### B\.2Multi\-Agent Debate Protocol

We present the debate protocol and aggregator that turn per\-expert votes into a single typed critique\. The prompt of each expert agent is deferred in Appendix[D](https://arxiv.org/html/2606.12916#A4)\. PRISM usesJ=3J\{=\}3specialists with fixed jurisdictions:*Force Field*,*Sampling*, and*Analysis*\. The Analysis expert also covers restraints, standard\-state correction, and thermodynamic\-cycle algebra\.

#### Shared expert output contract\.

Every expert returns the same JSON schema across design recommendation, pre\-execution review, and benchmark post\-execution review: a labelℓi∈\{Pass,Uncertain,Fail\}\\ell\_\{i\}\\in\\\{\\textsc\{Pass\},\\textsc\{Uncertain\},\\textsc\{Fail\}\\\}, confidenceκi∈\[0,1\]\\kappa\_\{i\}\\in\[0,1\], a load\-bearing*strategic\_insight*field, a list of severity\-scored concerns, and a short reasoning synthesis\. Cross\-domain commentary is allowed only when another subsystem directly affects the expert’s jurisdiction\.

#### Debate protocol\.

Given a fixed panel ofJJexperts, a single call to the panel runs two rounds\.\(i\)Round 1 \(independent, parallel\)\. Each expert receives the task description and either the pipeline source \(pre\-execution\) or the pipeline plus its execution results \(post\-execution\), and emits an independent vote in the shared schema\. AllJJexperts run concurrently\.\(ii\)Round 2 \(cross\-visibility, parallel\)\. Each expert is shown the round\-1 votes of the otherJ−1J\{\-\}1experts and may revise its own vote in the same schema\. AllJJexperts again run concurrently\.

The same two\-round protocol is used in three modes: design recommendation \(Phase 0, before any pipeline exists\), pre\-execution review \(PreEval, after the Code agent emitsπ\\piand the Engineer has removed mechanical faults\), and benchmark post\-execution review \(PostEval, after the benchmark returns per\-guestΔG^\\hat\{\\Delta G\}\)\.

#### Aggregator𝒜ρ\\mathcal\{A\}\_\{\\rho\}\.

The aggregator collapses the round\-2 votes into the single typed critiqueccused by the Code agent\. Each vote carries a labelℓi∈\{Pass,Uncertain,Fail\}\\ell\_\{i\}\\in\\\{\\textsc\{Pass\},\\textsc\{Uncertain\},\\textsc\{Fail\}\\\}, a confidenceκi∈\[0,1\]\\kappa\_\{i\}\\in\[0,1\], and a free\-text strategic insight\. The aggregator scores each label ass\(Pass\)=1s\(\\textsc\{Pass\}\)\{=\}1,s\(Uncertain\)=0\.5s\(\\textsc\{Uncertain\}\)\{=\}0\.5,s\(Fail\)=0s\(\\textsc\{Fail\}\)\{=\}0, and computes the reputation\- and confidence\-weighted mean

S=∑iρiκis\(ℓi\)∑iρiκi,S\\;=\\;\\frac\{\\sum\_\{i\}\\rho\_\{i\}\\,\\kappa\_\{i\}\\,s\(\\ell\_\{i\}\)\}\{\\sum\_\{i\}\\rho\_\{i\}\\,\\kappa\_\{i\}\},\(6\)which is collapsed to a single panel label via two thresholds:S≥0\.7⇒PassS\\geq 0\.7\\Rightarrow\\textsc\{Pass\},S≤0\.3⇒FailS\\leq 0\.3\\Rightarrow\\textsc\{Fail\}, otherwiseUncertain\. The accompanying critique text is built by concatenating the per\-expert strategic\-insight and concern strings in reputation\-weighted order; the full per\-expert transcript is also retained for the Code agent’s next\-trial context\.

### B\.3Reputation Update and Convergence

Each expertiicarries a Beta posterior over its reliability, defined as the probability that its pre\-execution vote is consistent with the post\-execution outcome\. The prior is uniform,Beta\(αi=1,βi=1\)\\mathrm\{Beta\}\(\\alpha\_\{i\}\{=\}1,\\beta\_\{i\}\{=\}1\)\. After each completed trial, the orchestrator inspects each expert’s\(cpre,t,cpost,t\)\(c\_\{\\text\{pre\},t\},c\_\{\\text\{post\},t\}\)vote pair and applies a single Beta update per expert:

αi\+\\displaystyle\\alpha\_\{i\}\\,\\mathrel\{\+\}=𝟏\[experticonsistent\],\\displaystyle=\\mathbf\{1\}\\\!\\left\[\\text\{expert \}i\\text\{ consistent\}\\right\],\(7\)βi\+\\displaystyle\\beta\_\{i\}\\,\\mathrel\{\+\}=𝟏\[expertiinconsistent\],\\displaystyle=\\mathbf\{1\}\\\!\\left\[\\text\{expert \}i\\text\{ inconsistent\}\\right\],where “consistent” is the agreement between the pre\-execution label of expertiiand the post\-execution outcome of the trial \(operationalized as the agreement betweenii’s pre\-execution vote and the aggregator’s post\-execution label\)\. The deterministic weight fed into the aggregator𝒜ρ\\mathcal\{A\}\_\{\\rho\}is the posterior mean

ρi=αiαi\+βi,\\rho\_\{i\}\\;=\\;\\frac\{\\alpha\_\{i\}\}\{\\alpha\_\{i\}\+\\beta\_\{i\}\},\(8\)which is the maximum\-likelihood estimate of reliability under the Beta–Bernoulli model\. A Thompson\-sampling variant drawsρ~i∼Beta\(αi,βi\)\\tilde\{\\rho\}\_\{i\}\\sim\\mathrm\{Beta\}\(\\alpha\_\{i\},\\beta\_\{i\}\)per trial; the deterministic posterior\-mean form is the default reported in the main results\.

#### Convergence within a task\.

Under the Beta–Bernoulli update of Eq\. \([7](https://arxiv.org/html/2606.12916#A2.E7)\),ρi\\rho\_\{i\}converges almost surely to expertii’s true reliabilitypip\_\{i\}as the number of consistent/inconsistent observations grows, withVar\(ρi\)=ρi\(1−ρi\)/\(αi\+βi\+1\)=𝒪\(1/ni\)\\mathrm\{Var\}\(\\rho\_\{i\}\)=\\rho\_\{i\}\(1\-\\rho\_\{i\}\)/\(\\alpha\_\{i\}\+\\beta\_\{i\}\+1\)=\\mathcal\{O\}\(1/n\_\{i\}\)\. Within theN=5N\{=\}5\-successful\-trial budget of a single task \(a trial is counted only when the agent produces a runnable pipeline\),ni≤5n\_\{i\}\\leq 5per expert, so the posterior mean has not converged topip\_\{i\}; the update therefore functions as a soft prior that prevents the aggregator from equal\-weighting an obviously\-miscalibrated expert with reliable ones\. The empiricalρt\\rho\_\{t\}trajectories per host are written to the per\-task reputation log released with the code\.

## Appendix CExperimental Details

We provide the experimental details: LLM and MD configurations, hardware, trial protocol, and benchmark splits\. Exact configurations and prompt texts are released with the code\.

#### LLM configuration\.

We record, for each agent role in the MDForge loop \(Code agent, Engineer agent, theJ=3J\{=\}3panel experts, Layer\-1 static verifier, and the aggregator𝒜ρ\\mathcal\{A\}\_\{\\rho\}\), the model identifier, sampling temperature, maximum context, and retry policy\. The same configuration is reused across all hosts; baselines are run on the same backbone with their own prompt templates\. We use the Claude Opus 4\.7 as the default backbone\.

#### Hardware and compute budget\.

Each guest evaluation runs in≈2\\approx 2GPU\-hours on a single A40 node\. The verbal RL loop continues untilN=5N\{=\}5successful trials accumulate per host \(a trial is counted when the agent produces a runnable pipeline; attempts that abort at Layer\-1, crash the Engineer, or are abandoned by the agent do not count towardNN\), so the total round count per host can exceed five\.

#### Trial protocol\.

A trial begins with the Code agent emitting a pipeline and ends either when the pipeline aborts at a stage boundary or when it completes the analysis stage and returns aΔG^\\hat\{\\Delta G\}per training guest\.*Coding success*counts a trial whose emitted code is runnable, regardless of whether it later crashes or does not converge inside MD\. The*best pipeline*reported in Table[1](https://arxiv.org/html/2606.12916#S4.T1)is selected, among trials that complete the production stage on all four training guests, as the one with the highest training\-set Kendallτ\\tau; the same selection rule is applied to all methods\.

#### Benchmark splits\.

For each of the three hosts \(CB\[7\], OAH, CBClip\), the guest set is split into four training guests \(visible to the verbal RL feedback loop\) and the remaining guests as a held\-out test set\. The training quadruple is chosen to span the experimentalΔG\\Delta Grange of each host and is fixed across all methods to keep comparisons aligned\. See Table[3](https://arxiv.org/html/2606.12916#A3.T3)for details\.

Table 3:Benchmark train/test splits\.Indices are zero\-based within each host’s guest list\.

## Appendix DAgent Prompts

We present the system prompts that define each agent’s role inside MDForge\. Three domain experts \(force\-field, sampling, analysis\) act as peer co\-designers and reviewers, while a single code\-writing agent, split between the Pipeline Writer and the Pipeline Engineer, is responsible for producing and debugging the four\-stage Python pipeline\. Each panel below distills the operative content of the corresponding system prompt; the full prompts are released with the code\.

Force\-Field Expert: system prompt summaryPersona\.A senior computational chemist with 10\+ years on small\-molecule force\-field development and on the parameterization of SAMPL3–9 host\-guest benchmarks\. The agent participates as a peer in pipeline design, not as a narrow gatekeeper\.Primary domain\.•*Guest force fields\.*GAFF/GAFF2\+AM1\-BCC as SAMPL default, with known∼\\sim1 kcal/mol over\-binding bias on cation\-π\\picontacts; OpenFF \(Sage/Parsley\) for more robust SMIRNOFF typing on non\-standard cations; CGenFF only when the host is also CHARMM\-parameterized\. Aware of the canonical antechamber failure where tertiary ammonium nitrogen is mistyped asnz\(sp2\) instead ofn4\.•*Host force fields\.*CB7 is parameterized with the same family as the guest\. Charge provenance of the SAMPL\-distributedcb7\.mol2is treated as suspect; regenerating with AM1\-BCC for consistency is the most defensible option\.•*Water and ions\.*TIP3P as the SAMPL baseline, with∼\\sim0\.5–1 kcal/mol bias toward less\-negativeΔG\\Delta Garound cationic ammoniums\. OPC or TIP4P\-Ew paired with matched ion sets \(Joung\-Cheatham for TIP3P/SPC; OPC\-trained ions for OPC\) for improved cation hydration\. Never mix water\-specific ions with the wrong water model\.•*File consistency\.*Atom\-type assignments must agree between mol2, frcmod and tleap;parmchk2must run after any atom\-type rewriting;tleap\.logshould be scanned for “Could not find…\\dots” warnings\.Co\-design behaviour\.Before enumerating concerns, the agent commits to a strategic position: for the specific chemical class at hand, what force\-field choice would it make and why\. It then audits whether the Writer’s choice is defensible by that standard\. In post\-eval, it reads anomalous energy components and the reportedΔG\\Delta Gthrough the lens of class\-specific systematic biases \(e\.g\., GAFF2\+AM1\-BCC\+TIP3P should land∼\\sim1–2 kcal/mol less negative than the experimental reference for CB7\-cation systems\)\. Cross\-domain remarks on sampling, restraints or analysis are welcome when they affect whether force\-field concerns are even detectable\.Operating modes\.The same agent is invoked in four modes: \(A\) design recommendation before any pipeline exists, \(B\) pre\-eval critique of a proposed design, \(C\) post\-eval interpretation of a single\-molecule run, and \(D\) post\-eval over a multi\-molecule benchmark, where it proposes the specific, surgical pipeline edit \(named file, region and change\) that would most improve next\-iteration MAE\.Output contract\.Each invocation returns a single JSON object with fieldslabel∈\{\\in\\\{pass,fail,uncertain\}\\\},confidence, a load\-bearingstrategic\_insight, a list ofconcerns\(each with severity and suggested focus\), and a finalreasoningsynthesis\. Thestrategic\_insightis asked to be method\-level rather than parameter\-level, system\-aware, comparative across alternatives, and literature\-grounded\.Sampling Expert: system prompt summaryPersona\.A senior molecular\-simulation methodologist with 10\+ years of experience designing alchemical, APR and umbrella\-sampling protocols for binding free energies\. Calibrated on hundreds of SAMPL\-style benchmarks and able to recognize an under\-sampled protocol from theλ\\lambdaschedule alone\.Primary domain\.•*Strategy choice\.*Two main families for absolute binding: Attach\-Pull\-Release \(APR, Henriksen\-Gilson; the SAMPL CB7 standard; typically 15–25 windows, 1–5 ns each\) and alchemical absolute binding \(less common on CB7, with known pitfalls around PME\+decoupling in openmmtools, softcore LJ atα=0\.5\\alpha\{=\}0\.5, electrostatics\-firstλ\\lambdaschedules, and dense LJ\-endpoint spacing\)\.•*Integrator and timestep\.*LangevinMiddleIntegratoras modern default; 2 fs with HBonds constraints, 4 fs with HBonds\+HMR; 1/ps friction\. 5 fs HMR is aggressive and requires validation\.•*Equilibration\.*Minimize, then NVT heating \(100–500 ps\), then NPT density \(0\.5–2 ns; longer for charged guests\)\. A pipeline that skips NPT is a red flag\.•*Replicates and seeds\.*For CB7\-class systems a single long replicate is often acceptable; each replicate must be seeded independently\.≥3\\geq 3replicates is preferred but rarely fits the budget\.•*Hardware and wall\-clock\.*OpenMM CUDA orpmemd\.cuda, never silently CPU\. The agent is explicitly briefed on a 2\-hour cap on the production stage and is told to surface the trade\-off \(“24 windows×\\times2 ns is defensible but exceeds budget; reduce to 12 or accept the violation”\) rather than demand the impossible\.Co\-design behaviour\.Thestrategic\_insightfield is required to answer three questions for the specific system class: \(i\) what sampling strategy is best practice \(e\.g\., APR with Henriksen\-Gilson corrections for CB7\), \(ii\) is the Writer’s choice defensible, and if it took the less\-common alchemical route, why might that be reasonable, and \(iii\) what is the largest sampling\-side risk to the reportedΔG\\Delta Gunder the chosen design and wall\-clock budget\. In post\-eval, the agent reads wall\-clock used vs\. designed, integrator stability evidence, T/P/density drift, per\-window dwell time, replicate scatter, autocorrelation, and MBAR overlap; cross\-domain remarks on force\-field or restraint choices are welcome when they affect sampling sufficiency\.Operating modes & output contract\.As for the force\-field expert, the same JSON schema is emitted in all four modes \(design recommendation, pre\-eval, single\-molecule post\-eval, multi\-molecule benchmark post\-eval\)\. In Mode D the agent is asked to name a surgical pipeline edit \(specific file, region, change\) that should improve next\-iteration MAE\.Analysis Expert: system prompt summaryPersona\.A senior simulation methodologist whose specialty is statistical\-mechanics estimators for free\-energy calculations and the thermodynamic\-cycle algebra around restraint application and release\. Has implemented MBAR/BAR/TI from scratch, derived the Boresch standard\-state correction from the Gaussian partition function, and spent years separating “the calculation didn’t converge” from “the estimator was wrong” from “the restraint correction has the wrong sign”\.Primary domain\.•*Restraint design\.*APR\-style \(1 distance along the host symmetry axis; analytic release; standard for CB7\) versus Boresch \(6\-DOF, closed\-form correction, overkill for symmetric hosts\)\. Anchors are rigid heavy atoms \(host ring carbon or carbonyl centroid; guest bridgehead or ammonium N\)\. Sane CB7 force constants:kr=5–20k\_\{r\}\{=\}5\\text\{\-\-\}20kcal/mol/Å2, Boresch angles/torsions 50–200 kcal/mol/rad2\.•*Standard\-state correction\.*The agent is given the harmonic well integralVwell=\(2πkT/k\)3/2V\_\{\\text\{well\}\}\{=\}\(2\\pi kT/k\)^\{3/2\}andΔGrelease=−kTln⁡\(Vstd/Vwell\)\\Delta G\_\{\\text\{release\}\}\{=\}\-kT\\ln\(V\_\{\\text\{std\}\}/V\_\{\\text\{well\}\}\)withVstd=1660V\_\{\\text\{std\}\}\{=\}1660Å3, and is explicitly warned that sign errors, leg misplacement, or a missing−RTln⁡\(nsym\)\-RT\\ln\(n\_\{\\text\{sym\}\}\)symmetry factor are the most common cause ofΔGbind\\Delta G\_\{\\text\{bind\}\}off by 5–15 kcal/mol\.•*Estimator choice\.*MBAR by default for absolute binding via decoupling; BAR/TI when only adjacent pairs or⟨∂H/∂λ⟩\\langle\\partial H/\\partial\\lambda\\rangleare available; the Henriksen\-Gilson three\-term decomposition for APR; MM\-PB/GBSA only as a rough first pass\.•*λ\\lambdaschedule\.*Electrostatics\-first, soft\-core LJ withα=0\.5\\alpha\{=\}0\.5, denser LJ spacing near the endpoint, 8–12 electrostatics windows plus 12–15 LJ windows, MBAR overlap≥0\.03\\geq 0\.03between adjacent states\.•*Uncertainty\.*Three layers: within\-replicate \(pymbar bootstrap/block\-jackknife\), across\-replicate scatter, and systematic biases \(force field, restraint correction, sampling\); reporting±0\.1\\pm 0\.1kcal/mol on a CB7 system is treated as a smell\.Co\-design behaviour\.Thestrategic\_insightfield is asked to answer: which estimator andλ\\lambdaschedule are right for the specific system, is the thermodynamic cycle algebra correctly implemented*in code*\(term\-by\-term sign check\), and what is the biggest analysis\-side risk to the reportedΔG\\Delta G\. The agent is given a numerical reasonableness band for CB7\-adamantylammonium \(literatureΔG≈−14\\Delta G\\approx\-14kcal/mol; GAFF2\+AM1\-BCC\+TIP3P should land in\[−14,−10\]\[\-14,\-10\]; values outside\[−18,−8\]\[\-18,\-8\]have something wrong\)\.Operating modes & output contract\.As for the other experts: four\-mode invocation and the shared JSON schema withlabel,confidence,strategic\_insight,concerns, andreasoning\.Pipeline Writer: system prompt summaryPipeline Writer \(code\-authoring agent\)\.*Mandate\.*Author a complete, runnable MD pipeline that computes the binding free energy of the host\-guest pair specified by the user, by writing Python code from scratch as a sequence ofK=4K\{=\}4sequential stages\. Docking is out of scope: a bound\-complexcomplex\.pdbis pre\-staged in the working directory and stage 01 reads it directly\.*Step 0: literature reconnaissance\.*Before writing a single line of code, the agent is required to useWebSearchandWebFetchfor at least three queries combining the host class with terms such as “binding free energy method”, “SAMPL benchmark”, “attach\-pull\-release” and “alchemical absolute binding”, and to fetch at least one methods paper\. TheRATIONALEparagraph must name the chosen method, cite at least one published reference for the choice on this host class, and explain why the method fits the system’s physics\. Methodological choices without a literature citation are not acceptable\.*Four\-stage pipeline\.*\(1\)01\_prep\.py: readcomplex\.pdb, pick force field/charge model/water, parameterize the guest viaantechamber\+parmchk2, build the solvated topology in a singletleapcall, then minimize\. \(2\)02\_equilibrate\.py: short NVT heating \+ NPT density, reporting T/P/density traces and drift\. \(3\)03\_production\.py: the expensive sampling stage, reading per\-window sampling length from theMDFORGE\_PRODUCTION\_NS\_PER\_WINDOWenvironment variable\. \(4\)04\_analysis\.py: MBAR/TI/WHAM/BAR with restraint standard\-state correction and−RTln⁡\(nsym\)\-RT\\ln\(n\_\{\\text\{sym\}\}\)symmetry correction, populatingdelta\_g\_kcal\_per\_molwith an uncertainty\.*Molecule\-agnostic invariant\.*A single pipeline must run on every \(host, guest\) task with only the input files changing\. The agent is forbidden from hardcoding guest names, atomic charges, symmetry numbers, or pH, and must read those fromtask\_metadata\.json; only the canonical filenamesguest\.mol2,host\.mol2,complex\.pdbmay appear in code\.*Output contract\.*The reply must follow an exactRATIONALE/ENTRY/FILEblock structure that the harness parses mechanically; deviations cause an automatic Layer\-1 failure\. Each stage must writestage\_NN\_result\.jsonon exit \(even on failure\) withstatus,wall\_time\_seconds,delta\_g\_kcal\_per\_mol,convergence\_flags,energy\_components,diagnostics, and awriter\_notesfield that the verifier experts read\.*Engineering pitfalls embedded in the prompt\.*The Writer is briefed on environment\-specific failure modes: openmmtools’ PME\-with\-decoupled\-electrostatics incompatibility \(must setannihilate\_electrostatics=True\), antechamber’snzvs\.n4mis\-typing on protonated tertiary amines \(with the explicit warning that AM1\-BCC charges on ammonium N are physically negative, so naive “N\>0N\{\>\}0” sanity checks must not be inserted\),tleapsourcing order \(small\-molecule FF before water leaprc\), CB7 host\-charge provenance, the requirement that result files be written before re\-raising on failure, hard wall\-clock budgeting for stage 03, mandatory GPU use, and a single\-GPU\-per\-molecule invariant \(noProcessPoolExecutoracross GPUs inside one pipeline\)\.Pipeline Engineer: system prompt summaryPipeline Engineer \(debug\-and\-run agent\)\.*Mandate\.*Given the four stage files the Writer just emitted, debug, iterate, and run the pipeline end\-to\-end in the sandbox until stage 04 outputs a finiteΔG\\Delta G\. The session ends with a non\-nulldelta\_g\_kcal\_per\_molor it is considered failed; the next trial inherits worse context if it fails\.*Execution discipline\.*Stages are run sequentially with synchronous blockingBashcalls \(timeoutset generously\)\. After each run the agent inspectsstage\_NN\_result\.jsonbefore proceeding\. Stage 03 must not be re\-run once it has succeeded: if stage 04 then crashes, only stage 04 is re\-run against the existing per\-window energies\.*Hard rules \(allowed vs\. forbidden edits\)\.*The Engineer may make mechanical fixes \(switchingantechamber \-c bcc→\\to\-c rcto skip a hang, removing a wrong sanity check, fixingtleapsourcing order, addingos\.makedirs\(\.\.\., exist\_ok=True\), catching format exceptions\)\. It is forbidden from \(i\) changing methodological choices \(APR→\\toDDM, GAFF2→\\toOPLS, TIP3P→\\toOPC, reducing production length\), \(ii\) gaming QC gates to manufacture success \(silencing flags, zeroing uncertainty, lowering thresholds\), \(iii\) introducing molecule\-specific hardcoding, or \(iv\) bypassingMDFORGE\_PRODUCTION\_NS\_PER\_WINDOWto makeΔG\\Delta Glook better in the debug pass\.*Forbidden tools\.*The Engineer runs in a single, non\-resumable session\.ScheduleWakeup,Monitor,ToolSearch, andrun\_in\_background: Trueare all explicitly disabled; the only available tools areRead,Write,Edit,Bash\. Long\-running stages are handled by blockingBashwith a large timeout rather than backgrounding\.*Exit conditions\.*Success:stage\_04\_result\.jsoncarries a finite numericΔG\\Delta Gand all earlier stages completed cleanly\. Time budget:∼\\sim60 minutes wall\-clock for the whole debug session; the agent wraps up gracefully if running short\. Stuck: if the same kind of fix has been tried twice on the same stage and the same error keeps coming back, the agent stops and documents the blocker for the next trial’s Writer to handle\.
MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

Similar Articles

ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery

Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design

Stein Kernelized Molecular Dynamics for Active Learning of Interatomic Potentials

Molecular Lead Optimization via Agentic Tool Planning

Controllable Molecular Generative Foundation Models

Submit Feedback

Similar Articles

ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery
Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design
Stein Kernelized Molecular Dynamics for Active Learning of Interatomic Potentials
Molecular Lead Optimization via Agentic Tool Planning
Controllable Molecular Generative Foundation Models