ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery

arXiv cs.LG 05/14/26, 04:00 AM Papers
Summary
ToolMol is an evolutionary agentic framework that combines a multi-objective genetic algorithm with an LLM-based operator to design small-molecule drugs, achieving state-of-the-art binding affinity and drug-likeness on multiple protein targets.
arXiv:2605.12784v1 Announce Type: new Abstract: Advances in large language models (LLMs) have recently opened new and promising avenues for small-molecule drug discovery. Yet existing LLM-based approaches for molecular generation often suffer from high rates of invalid and low-quality ligand candidates, a result of the syntactic limitations of current models with regard to molecular strings. In this paper, we introduce $\texttt{ToolMol}$, an evolutionary agentic framework for de novo drug design. $\texttt{ToolMol}$ combines a multi-objective genetic algorithm with an agentic LLM operator that iteratively updates the ligand population. We build a comprehensive toolbox of RDKit-backed functions that allows our agentic operator to consisently make precise ligand modifications. $\texttt{ToolMol}$ achieves state-of-the-art performance on multi-objective property optimization tasks, discovering drug-like and synthesizable ligands that have $>10\%$ stronger predicted binding affinity compared to existing methods, evaluated on three protein targets. $\texttt{ToolMol}$ ligands additionally achieve state-of-the-art results in gold-standard Absolute Binding Free Energy scores, gaining over existing methods by over $35\%$. By studying chain-of-thought reasoning traces, we observe that tool-calling enables the model to more faithfully execute its planned modifications, efficiently exploiting the strong chemical prior knowledge in LLMs.
Original Article
View Cached Full Text
Cached at: 05/14/26, 06:18 AM
# ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery
Source: [https://arxiv.org/html/2605.12784](https://arxiv.org/html/2605.12784)
Sharvaree VadgamaSumanth VaramballyPeter EckmannMichael K\. GilsonRose Yu

###### Abstract

Advances in large language models \(LLMs\) have recently opened new and promising avenues for small\-molecule drug discovery\. Yet existing LLM\-based approaches for molecular generation often suffer from high rates of invalid and low\-quality ligand candidates, a result of the syntactic limitations of current models with regard to molecular strings\. In this paper, we introduceToolMol, an evolutionary agentic framework for de novo drug design\.ToolMolcombines a multi\-objective genetic algorithm with an agentic LLM operator that iteratively updates the ligand population\. We build a comprehensive toolbox of RDKit\-backed functions that allows our agentic operator to consisently make precise ligand modifications\.ToolMolachieves state\-of\-the\-art performance on multi\-objective property optimization tasks, discovering drug\-like and synthesizable ligands that have\>10%\>10\\%stronger predicted binding affinity compared to existing methods, evaluated on three protein targets\.ToolMolligands additionally achieve state\-of\-the\-art results in gold\-standard Absolute Binding Free Energy scores, gaining over existing methods by over35%35\\%\. By studying chain\-of\-thought reasoning traces, we observe that tool\-calling enables the model to more faithfully execute its planned modifications, efficiently exploiting the strong chemical prior knowledge in LLMs\.

Machine Learning, ICML

## 1Introduction

Small molecule drug discovery is a resource\-intensive process that requires generated compounds to satisfy many crucial properties, historically requiring many rounds of wet lab trial\-and\-error\. Advances in machine learning have yielded many generative methods that aim to solve this problem\. Most previous work has focused on specialized generative models such as VAEs\(Eckmann et al\.,[2022](https://arxiv.org/html/2605.12784#bib.bib8),[2025](https://arxiv.org/html/2605.12784#bib.bib9); Noh et al\.,[2022](https://arxiv.org/html/2605.12784#bib.bib30)\), diffusion models\(Lee et al\.,[2023](https://arxiv.org/html/2605.12784#bib.bib23); Zhou et al\.,[2024](https://arxiv.org/html/2605.12784#bib.bib45); Joshi et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib21); Guan et al\.,[2024](https://arxiv.org/html/2605.12784#bib.bib14); Dorna et al\.,[2024](https://arxiv.org/html/2605.12784#bib.bib6)\), and group equivariant diffusion models\(Hoogeboom et al\.,[2022](https://arxiv.org/html/2605.12784#bib.bib18); Vadgama et al\.,[2026](https://arxiv.org/html/2605.12784#bib.bib41); Liu et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib25)\)\. Optimization frameworks such as evolutionary algorithms\(Jensen,[2019](https://arxiv.org/html/2605.12784#bib.bib19)\)and Bayesian optimization algorithms\(Zhu et al\.,[2023](https://arxiv.org/html/2605.12784#bib.bib46)\)have also been applied to this problem\. However, existing generative methods often do not target important molecular properties, limiting their ability to generate ligands that simultaneously achieve desirable binding affinity, drug\-likeness and synthesizability\(Crucitti et al\.,[2024](https://arxiv.org/html/2605.12784#bib.bib5)\)\.

Large Language Models \(LLMs\) have recently begun to garner interest as a method to generate small molecules, showing promise in generating strong, drug\-like ligands\(Wang et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib42)\)\. Unlike specialized generative models, LLMs benefit from large\-scale pretraining on domain\-relevant, scientific text, which gives them the distinct advantage of inherent familiarity with the optimization task, as well as with the practices and heuristics of chemical research \(e\.g\. common reactions and lead optimization techniques\)\(White,[2023](https://arxiv.org/html/2605.12784#bib.bib43)\)\. Extensive LLM\-related works have demonstrated the ability of current LLMs to predict molecular properties\(Guo et al\.,[2023](https://arxiv.org/html/2605.12784#bib.bib15)\)and generate novel structures\(Flam\-Shepherd & Aspuru\-Guzik,[2023](https://arxiv.org/html/2605.12784#bib.bib12)\)\. Most recently, MOLLEO\(Wang et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib42)\)proposed a genetic algorithm that directly incorporates LLMs as a mutation and crossover operator to generate molecular offspring, outperforming many specialized generative models in generating ligands with desirable properties\.

However, a significant drawback with current LLMs is that they often fail at generating syntactically valid molecular strings, even when prompted to inspect their outputs carefully\. We observe that this failure occurs consistently, appearing in more than30%30\\%of attempted molecule generations on average, even on strong reasoning models like GPT\-OSS\-120B\(OpenAI et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib32)\)\. This significantly hinders the progress of current LLM\-based methods\. For instance, MOLLEO falls back on weaker non\-LLM based, deterministic crossover/mutation operators when the LLM operator fails to generate a valid SMILES result\. We claim that this method of allowing an LLM to directly output molecular strings is an imperfect method of utilizing LLMs for efficient, property\-based drug discovery\.

In this work, we introduceToolMol, a novel drug discovery algorithm focused on de novo small molecule drug design\. ToolMol combines a multi\-objective genetic algorithm with an agentic LLM operator that iterates upon the ligand population using a set of deterministic, RDKit\-backed tools\. ToolMol solves the problem of invalid molecular generations by exploiting the highly\-optimized LLM tool\-calling functionality prevalent in current models\. Instead of allowing the LLM to modify the molecular string encoding directly, ToolMol abstracts this process by providing the LLM with tools that allow it to simply provide structural parameters for its desired modifications\. This not only greatly decreases the number of invalid SMILES strings generated by the LLM, but also yields a significant improvement in the molecular properties of generated ligands, including predicted binding affinity, drug\-likeness, and synthesizability\.

We summarize the contributions of this work as below:

- •We presentToolMol, an evolutionary agentic drug discovery framework that combines a multi\-objective genetic algorithm with an agentic LLM operator to consistently generate syntactically valid and property\-optimizing molecules\.
- •We achieve state\-of\-the\-art results in multi\-objective property optimization across three protein targets, with predicted binding affinity gains exceeding10%10\\%over prior methods, as well as state\-of\-the\-art Absolute Binding Free Energy scores for two studied targets, demonstrating the practical utility of LLMs for de novo drug design\.
- •We study the effectiveness of our framework through case studies, and observe that the agentic tool\-calling process significantly improves concordance between the LLM’s reasoning trace and the actual ligand modifications\.

## 2Related Work

#### Generative models for molecular design

A variety of generative architectures have been developed for molecular design, each learning an implicit distribution over chemical space and leveraging an external oracle to guide generation toward molecules with desirable binding properties\. VAE\-based approaches such as\(Eckmann et al\.,[2022](https://arxiv.org/html/2605.12784#bib.bib8); Jin et al\.,[2018](https://arxiv.org/html/2605.12784#bib.bib20); Eckmann et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib9); Gómez\-Bombarelli et al\.,[2018](https://arxiv.org/html/2605.12784#bib.bib13)\)have shown promise, but are generally unaware of the 3D protein structure\. To address this, DecompOpt\(Zhou et al\.,[2024](https://arxiv.org/html/2605.12784#bib.bib45)\)and DecompDiff\(Guan et al\.,[2024](https://arxiv.org/html/2605.12784#bib.bib14)\)are diffusion models that condition on the protein structure, and are further guided toward optimal ligand molecules by an oracle and some external optimization algorithm\. Pocket2Mol\(Peng et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib35)\)employs a graph neural network, composed of several encoder and predictor modules, that auto\-regressively predicts the location and type of each subsequent ligand atom based on existing ligand atoms and the protein pocket\. PAFlow\(Zhou et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib44)\)employs a conditional flow\-matching algorithm guided by a learnable number\-of\-atoms predictor model to generate molecules that better match the size of the binding pocket\.

#### Multi\-objective frameworks

A critical challenge in drug design is multi\-objective optimization, as viable drug candidates must satisfy several property criteria simultaneously\. TAGMol\(Dorna et al\.,[2024](https://arxiv.org/html/2605.12784#bib.bib6)\)and DrugDiff\(Oestreich et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib31)\)use supplementary guide models to influence the Langevin dynamics during sampling, resulting in generated molecules that satisfy multiple property criteria\. Graph\-GA\(Jensen,[2019](https://arxiv.org/html/2605.12784#bib.bib19)\)employs an evolutionary algorithm that keeps track of an active population of molecules, applying deterministic crossover and mutation rules to progressively optimize multiple desired properties\. OMTRA\(Dunn et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib7)\)presents a multi\-modal, flexible flow matching model for structure\-based drug design\. HN\-GFN\(Zhu et al\.,[2023](https://arxiv.org/html/2605.12784#bib.bib46)\)utilizes a multi\-objective Bayesian optimization algorithm combined with a GFlowNet to optimize for several properties, including molecular diversity\.

#### LLMs for molecular generation

The use of LLMs in drug discovery is currently limited\. Current approaches address general\-purpose chemistry tasks\(Bran et al\.,[2023](https://arxiv.org/html/2605.12784#bib.bib3); Boiko et al\.,[2023](https://arxiv.org/html/2605.12784#bib.bib2); Ma et al\.,[2024](https://arxiv.org/html/2605.12784#bib.bib28); Choi et al\.,[2026](https://arxiv.org/html/2605.12784#bib.bib4)\), or fine\-tune LLMs to design strong binders in one shot\(Sheikholeslami et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib37)\)\. Genetic algorithms that utilize LLMs for crossover/mutation operations are better suited to the task of property optimization because they are able to incorporate feedback from oracles\. MOLLEO\(Wang et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib42)\), which represents the state\-of\-the\-art in LLM\-guided drug generation, augments Graph\-GA\(Jensen,[2019](https://arxiv.org/html/2605.12784#bib.bib19)\)by replacing algorithmic crossovers and mutations with LLM\-driven structural modifications, and achieves strong results across three protein targets\. The authors demonstrate the potential of these inherently chemistry\-aware LLMs to be a competitive generative method in drug discovery\. A relevant prior work on tool\-calling in biochemistry is El Agente Estructural\(Choi et al\.,[2026](https://arxiv.org/html/2605.12784#bib.bib4)\), a multimodal framework that equips an LLM with tools for visual and geometric inspection of molecules in 3D space\. However, it is designed for human\-in\-the\-loop interaction rather than integration into an optimization loop\.

In this work, we design a minimal yet effective toolbox that improves the quality of LLM\-suggested crossover and mutation operations within a genetic algorithm\. To our knowledge, this is the first work to utilize LLM tool\-calling within an evolutionary framework to optimize molecular properties for drug discovery\.

![Refer to caption](https://arxiv.org/html/2605.12784v1/figures/Figure1_real_molecules.png)Figure 1:Overview of ToolMol\.\(a\) We sample an initial ligand population from ZINC 250K\. \(b\) Parent ligands are sampled for crossovers & mutations with probability proportional to their fitness\. \(c\) An agent with access to a set of modification tools generates new ligands using structures from the selected parents\. \(d\) New offspring are evaluated by an oracle for all relevant objectives\. \(e\) A new population is formed from the non\-dominated Pareto frontier of the current population\. Steps b→\\rightarrowe are repeated until an oracle budget is reached\.

## 3ToolMol

We introduce ToolMol, an agentic, multi\-objective genetic algorithm framework that utilizes a tool\-calling LLM to make precise and guided modifications on the ligand population\. In this section, we first introduce our optimization problem, then describe the genetic algorithm followed by the tool\-calling process that comprise ToolMol\.

#### Problem Statement

We can broadly represent our molecular optimization problem as

𝐌∗=argmaxm∈MΦ\(m\)\\mathbf\{M\}^\{\*\}=\\operatorname\*\{argmax\}\_\{m\\in M\}\\Phi\(m\)wheremmis any valid molecule \(ligand\) andMMis the entire valid chemical space\.Φ\\Phiis an evaluation function that yields the fitness ofmmfornnobjectives\. This function can be defined in several ways; one naive approach is to treat the ”fitness” of a candidate \(the viability or utility of the member\) as a weighted sum of all objectives, i\.e\.Φ\(m\)=∑iwi∗ϕi\(m\)\\Phi\(m\)=\\sum\_\{i\}\\mathrm\{w\}\_\{i\}\*\\phi\_\{i\}\(m\), whereϕi\\phi\_\{i\}is theiith objective evaluator andwi\\mathrm\{w\}\_\{i\}is the weight given to that objective\.111Note that if any objectiveiiis minimizing instead of maximizing, we take the negation of the objective evaluator to beϕi\\phi\_\{i\}These weights are arbitrary and can be difficult to choose in practice\. Here, we avoid the need to make such arbitrary choices via partial ordering of molecules and the Pareto frontier\. Formally, we can compare two molecules by notatingm′≻mm^\{\\prime\}\\succ m, meaning thatm′m^\{\\prime\}strictly dominatesmmif and only if∀i:ϕi\(m′\)\>ϕi\(m\)\\forall i:\\phi\_\{i\}\(m^\{\\prime\}\)\>\\phi\_\{i\}\(m\)\. We can then define the optimal setM∗M^\{\*\}to be the non\-dominated Pareto frontier, given byM∗=\{m∈S:∄m′,m′≻m\}M^\{\*\}=\\\{m\\in S:\\nexists m^\{\\prime\},m^\{\\prime\}\\succ m\\\}, or the set of all molecules that are not dominated by any other molecule\. In other words,M∗M^\{\*\}is constructed by all ligands for which no other ligand strictly exceeds it in every objective\. Because the full chemical spaceMMis far too large to ever be sufficiently explored, we search in a limited subspace ofMM\(roughly determined in practice by random initial seeding\), and aim to find the non\-dominated Pareto frontierM∗M^\{\*\}within this subspace\.

In this work, we consider 3 objectives: the binding affinity \(ΔG\\Delta G, in kcal/mol\) ofmmto a particular protein binding target, as estimated with Boltz\-2\(Passaro et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib34)\), the quantitative estimate of drug\-likeness\(Bickerton et al\.,[2012](https://arxiv.org/html/2605.12784#bib.bib1), QED\), and the estimated synthetic accessibility\(Ertl & Schuffenhauer,[2009](https://arxiv.org/html/2605.12784#bib.bib10), SA\)\. All 3 of these objectives are standard targets in generative modeling, with QED and SA ensuring some level of ligand\-structure soundness and estimated binding affinity measuring the practical utility of the molecule as a binder of the targeted protein pocket\.

### 3\.1Multi\-objective Genetic Algorithm

The underlying framework for the ToolMol algorithm is a multi\-objective genetic algorithm \(MOGA\)\. The pseudocode for this process is detailed in Algorithm[1](https://arxiv.org/html/2605.12784#alg1)\(see Figure[1](https://arxiv.org/html/2605.12784#S2.F1)for a visual guide\)\. We begin with an initial populationℳ0\\mathcal\{M\}\_\{0\}randomly sampled from the ZINC 250K\(Sterling & Irwin,[2015](https://arxiv.org/html/2605.12784#bib.bib39)\)dataset, which provides a good starting base of drug\-like & synthesizable structures\. We define an ”oracle budget”BB, which determines how many molecules we evaluate before we terminate the algorithm\. We determine the stopping point in this way because predicting binding affinity is generally very computationally expensive; we set a hard limit on the number of total evaluations regardless of the population or offspring size\.

Input:Initial population

ℳ0\\mathcal\{M\}\_\{0\}, offspring size

nn, oracle budget

B=1000B=1000
Output:All molecule generations

ℳout\\mathcal\{M\}\_\{\\text\{out\}\}
ℳc←ℳ0\\mathcal\{M\}\_\{c\}\\leftarrow\\mathcal\{M\}\_\{0\}

//Current population

ℳout←ℳ0\\mathcal\{M\}\_\{\\text\{out\}\}\\leftarrow\\mathcal\{M\}\_\{0\}

//All molecules

while*oracle budget<B<B*do

offspring←\[\]\\text\{offspring\}\\leftarrow\[\\,\]
for*i←1i\\leftarrow 1tonn*do

Sample

m0,m1∼ℳcm\_\{0\},\\,m\_\{1\}\\sim\\mathcal\{M\}\_\{c\}with probability

∝kΦ\(m\)\\propto k^\{\\Phi\(m\)\}for const

kk
offspring\.append\([AgentGen](https://arxiv.org/html/2605.12784#S3.SS2)\(m0,m1\)\)\\text\{offspring\}\.\\text\{append\}\\bigl\(\{\\color\[rgb\]\{1,\.5,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,\.5,0\}\\hyperref@@ii\[agentgen\]\{\\textsc\{AgentGen\}\}\}\(m\_\{0\},\\,m\_\{1\}\)\\bigr\)
end for

ℳout←ℳout∪offspring\\mathcal\{M\}\_\{\\text\{out\}\}\\leftarrow\\mathcal\{M\}\_\{\\text\{out\}\}\\cup\\text\{offspring\}
ℳc←ℳc∪offspring\\mathcal\{M\}\_\{c\}\\leftarrow\\mathcal\{M\}\_\{c\}\\cup\\text\{offspring\}
for*m∈ℳcm\\in\\mathcal\{M\}\_\{c\}*do

Compute

ϕi\(m\)\\phi\_\{i\}\(m\)for each objective

ii
end for

ℳc←ParetoFrontier\(ℳc\)\\mathcal\{M\}\_\{c\}\\leftarrow\\textsc\{ParetoFrontier\}\(\\mathcal\{M\}\_\{c\}\)
end while

return

ℳout\\mathcal\{M\}\_\{\\text\{out\}\}

Algorithm 1ToolMol Genetic AlgorithmWe sample parent moleculesm0,m1m\_\{0\},m\_\{1\}from the current population with probabilities proportional tokΦ\(m\)k^\{\\Phi\(m\)\}for some constantkk\.Φ\(m\)\\Phi\(m\)is the scalar fitness ofmm, given byΦ\(m\)=∑ifi\(m\)\\Phi\(m\)=\\sum\_\{i\}f\_\{i\}\(m\), wherefi\(m\)f\_\{i\}\(m\)is theiith objective scaled to\[0,1\]\[0,1\]\.222For the unbounded binding affinity metric, we scale by setting a lower bound of 0 kcal/mol and a safe upper bound of−13\-13kcal/mol, a value that we have never observed our affinity predictor exceed\.We also explore an alternative sampling method based on Pareto ordering, the results of which may be found in Appendix[C](https://arxiv.org/html/2605.12784#A3)\. We pass the sampled parent molecules into the agentic LLM operator,AgentGen, which is described in the[next section](https://arxiv.org/html/2605.12784#S3.SS2)\. Specifically, for every pair of sampled parents,AgentGencreates one new candidate, which is added to the current set of offspring\. Afternnnew candidates, we merge the current population with the new offspring\. We then evaluate all resulting molecules for each objective and form the next generation by taking the non\-dominated Pareto frontier of the current population\. We continue the evolution until we exhaust our oracle budget\.

### 3\.2AgentGen: Tool\-calling LLM

Next, we describe theAgentGenfunction, which represents the tool\-calling process that the agentic operator takes to generate new molecules\. This process is inspired by the crossover & mutation operations carried out by classical genetic algorithms, and leverages tool\-calling to allow an LLM to act as the sole GA operator\.

Given two input moleculesm1,m2m\_\{1\},m\_\{2\}, we first format an initial prompt based on our desired objectives\. In order to give the LLM sufficient context to execute accurate tool\-calls, we append detailed structure information on both input molecules, given by RDKit\. This consists of an identification of every atom in each molecule, its RDKit index, number of substitutable hydrogens, neighboring atoms, centrality within the ligand, and more\. This information is necessary for the LLM to provide correct parameters for its tool calls, and aids it in making more informed decisions by providing structural context about the input molecules\. Full details about the input and intermediate prompt \(given byPromptFormat\) can be found in Appendix[B\.2](https://arxiv.org/html/2605.12784#A2.SS2)\.

We then begin the tool\-calling iteration process, which proceeds for at mostmax\_stepsiterations \(we usemax\_steps=10\\textit\{max\\\_steps\}=10in our experiments, although we note that we rarely ever meet or exceed this threshold\)\. At any given step, the LLM has access to a toolbox of 7 RDKit\-backed functions that aid with structural modifications\. For all functions, the LLM is responsible for providing all parameters specified in the function definition\. If any specification results in an invalid operation \(e\.g\. no available valence\), the tool returns a failed state and specifies the particular error\.

![[Uncaptioned image]](https://arxiv.org/html/2605.12784v1/figures/Toolbox.png)

The LLM\-callable functions are listed in the graphic above\. For complete details on each tool and its function parameters, see Appendix[B\.1](https://arxiv.org/html/2605.12784#A2.SS1)\.

The LLM is encouraged to callcrossover\_moleculeson the first step, when initially passed two input molecules from the parent population\. On subsequent steps, the LLM is encouraged to use other tools on the molecule resulting from the crossover operation\. This simulates how a standard genetic algorithm typically performs a crossover on two parent candidates, then an optional mutation on the resulting offspring\(Jensen,[2019](https://arxiv.org/html/2605.12784#bib.bib19)\)\.

![Refer to caption](https://arxiv.org/html/2605.12784v1/figures/Figure2.png)Figure 2:An example tool\-calling process\.The agent first decides to perform a crossover on the input molecules, utilizingcrossover\_molecules\. Then it decides to attach a methoxy group to the benzene structure, utilizingadd\_functional\_group\. At this point, it decides that the modifications are sufficient, and the new molecule is added to the offspring population\.A single tool call either succeeds and returns the new modified molecule, or fails and returns a message detailing the reason for the error\. In both cases, information about the executed tool call and structural details about the new molecule are added to the conversation history for the next tool\-calling iteration\. This process repeats until we either hit themax\_stepsiteration budget or until the LLM decides that it has made sufficient modifications\. Because all modifications are made in the deterministic graph space defined by the RDKitMolobject, the final molecule returned by this process is guaranteed to be a valid molecule with a valid SMILES encoding\. The only exception is if the LLM fails to call any function correctly formax\_stepsiterations, which we observe to be extremely uncommon\. A short example of this process is demonstrated in Figure[2](https://arxiv.org/html/2605.12784#S3.F2)\.

## 4Experiments

### 4\.1Experimental Setup

We evaluate the effectiveness of ToolMol on the multi\-objective task of optimizing for protein\-ligand binding affinity while preserving drug\-likeness and synthetic accessibility\.

#### Targets\.

In this work, we focus on three functionally & structurally unique protein\-binding targets:

1. 1\.c\-MET\(MET\_HUMAN\): Hepatocyte growth factor receptor
2. 2\.BRD4\(BRD4\_HUMAN\): Bromodomain\-containing protein 4
3. 3\.ACAA1\(THIK\_HUMAN\): 3\-ketoacyl\-CoA thiolase, peroxisomal

The targets c\-MET and BRD4 have significant medicinal chemistry literature\(Hong et al\.,[2016](https://arxiv.org/html/2605.12784#bib.bib17); Organ & Tsao,[2011](https://arxiv.org/html/2605.12784#bib.bib33)\), while ACAA1 has not been significantly explored as a drug target\. In particular, ACAA1 has no associated experimental binding\-affinity measurements in BindingDB\(Liu et al\.,[2007](https://arxiv.org/html/2605.12784#bib.bib26)\), a database of experimentally measured interactions between drug\-target proteins and ligands\. Thus, the results for ACAA1 report on the performance of the LLM on a target that has not appeared frequently within its pretraining dataset\.

#### Pipeline\.

For ToolMol, we seed 60 small molecules from ZINC 250K\(Sterling & Irwin,[2015](https://arxiv.org/html/2605.12784#bib.bib39)\)to comprise our initial population, with an offspring size of 35\. We utilize an exponential constantk=10k=10for the parent sampling step\. For all LLM\-based components, including baselines, we use GPT\-OSS\-120B\(OpenAI et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib32)\)\. We estimate ligand\-protein binding affinities with the recent biomolecular foundation model Boltz\-2\(Passaro et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib34)\)\. This decision is primarily motivated by the high accuracy Boltz\-2 demonstrates in favoring molecules that score highly on gold\-standard Absolute Binding Free Energy\(Feng et al\.,[2022](https://arxiv.org/html/2605.12784#bib.bib11), ABFE\)metrics\. We provide a brief correlation analysis between predicted Boltz\-2, ABFE, and AutoDock\(Trott & Olson,[2009](https://arxiv.org/html/2605.12784#bib.bib40)\)scores in Appendix[E\.2](https://arxiv.org/html/2605.12784#A5.SS2)to further justify this decision\.

#### Baselines\.

We evaluate ToolMol against the following methods\.

1. 1\.Pocket2Mol\(Peng et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib35)\)E\(3\)\-equivariant autoregressive model that generates 3D molecules conditioned on a protein pocket via diffusion\.
2. 2\.TAGMol\(Dorna et al\.,[2024](https://arxiv.org/html/2605.12784#bib.bib6)\)3D structure\-based framework that decouples diffusion sampling from gradient\-based property guidance, using predicted binding affinity, QED, and SA to steer generation\.
3. 3\.PAFlow\(Zhou et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib44)\)Conditional flow matching method that leverages a protein\-ligand interaction predictor to guide generation toward high\-affinity, drug\-like molecules\.
4. 4\.Graph\-GA\(Jensen,[2019](https://arxiv.org/html/2605.12784#bib.bib19)\)Genetic algorithm operating via predefined crossover and mutation rules on molecular graphs\.
5. 5\.ShinkaEvolve\(Lange et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib22)\): We adapt the hybrid MAP\-Elites\(Mouret & Clune,[2015](https://arxiv.org/html/2605.12784#bib.bib29)\)and islands algorithm from ShinkaEvolve, an LLM\-based evolutionary approach which achieved fantastic results in algorithm design\. We adapt this method for drug design, and test variants with both MOLLEO\-style LLM mutations and the ToolMol toolbox \(details of our implementation are in Appendix[D](https://arxiv.org/html/2605.12784#A4)\)\.
6. 6\.MOLLEO\(Wang et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib42)\)MOLLEO extends Graph\-GA by replacing predefined genetic operators with an LLM that directly performs crossovers and mutations on molecular candidates, optimizing jointly over binding affinity, QED, and SA\.

We note that Pocket2Mol, TAGMol, and PAFlow are designed for oracle\-free inference sampling, while all other baseline methods explicitly use affinity feedback as guidance during search\. We also note that we use Boltz\-2 for affinity prediction for all relevant baselines\.

Table 1:Results of ToolMol compared to generative modeling and LLM\-based baselines across three protein targets: c\-MET, BRD4, and ACAA1\. N/A indicates zero observations across all seeded runs\. Best results arebolded, second best areunderlinedMetricPocket2MolTAGMolPAFlowGraph\-GAShinkaEvolve\(No Tools\)ShinkaEvolve\(Tools\)MOLLEOToolMol\(ours\)c\-MET\(MET\_HUMAN\)BA \(↓\\downarrow\)−11\.27±0\.29¯\\underline\{\-11\.27\\pm 0\.29\}−11\.45±0\.11\\mathbf\{\-11\.45\\pm 0\.11\}−10\.63±0\.02\-10\.63\\pm 0\.02−10\.21±0\.26\-10\.21\\pm 0\.26−10\.17±0\.28\-10\.17\\pm 0\.28−11\.08±0\.07\-11\.08\\pm 0\.07−10\.15±0\.19\-10\.15\\pm 0\.19−11\.00±0\.09\-11\.00\\pm 0\.09FA \(↓\\downarrow\)−9\.39±0\.27\-9\.39\\pm 0\.27−7\.39±0\.43\-7\.39\\pm 0\.43−7\.58±0\.44\-7\.58\\pm 0\.44−9\.19±0\.29\-9\.19\\pm 0\.29−9\.72±0\.14¯\\underline\{\-9\.72\\pm 0\.14\}−9\.72±0\.35¯\\underline\{\-9\.72\\pm 0\.35\}−9\.62±0\.11\-9\.62\\pm 0\.11−10\.35±0\.17\\mathbf\{\-10\.35\\pm 0\.17\}HV \(↑\\uparrow\)0\.58±0\.0070\.58\\pm 0\.0070\.56±0\.0050\.56\\pm 0\.0050\.50±0\.00060\.50\\pm 0\.00060\.57±0\.010\.57\\pm 0\.010\.58±0\.010\.58\\pm 0\.010\.59±0\.0080\.59\\pm 0\.0080\.60±0\.01¯\\underline\{0\.60\\pm 0\.01\}0\.62±0\.01\\mathbf\{0\.62\\pm 0\.01\}BRD4\(BRD4\_HUMAN\)BA \(↓\\downarrow\)−10\.02±0\.24\-10\.02\\pm 0\.24−9\.54±0\.10\-9\.54\\pm 0\.10−8\.33±0\.09\-8\.33\\pm 0\.09−9\.79±0\.26\-9\.79\\pm 0\.26−9\.59±0\.09\-9\.59\\pm 0\.09−10\.80±0\.19\\mathbf\{\-10\.80\\pm 0\.19\}−9\.87±0\.23\-9\.87\\pm 0\.23−10\.64±0\.28¯\\underline\{\-10\.64\\pm 0\.28\}FA \(↓\\downarrow\)−8\.61±0\.17\-8\.61\\pm 0\.17−8\.06±0\.08\-8\.06\\pm 0\.08−7\.54±0\.12\-7\.54\\pm 0\.12−9\.07±0\.31\-9\.07\\pm 0\.31−9\.38±0\.03\-9\.38\\pm 0\.03−9\.20±0\.07\-9\.20\\pm 0\.07−9\.48±0\.19¯\\underline\{\-9\.48\\pm 0\.19\}−9\.91±0\.18\\mathbf\{\-9\.91\\pm 0\.18\}HV \(↑\\uparrow\)0\.52±0\.020\.52\\pm 0\.020\.53±0\.0080\.53\\pm 0\.0080\.43±0\.010\.43\\pm 0\.010\.56±0\.020\.56\\pm 0\.020\.56±0\.0080\.56\\pm 0\.0080\.57±0\.0010\.57\\pm 0\.0010\.59±0\.01¯\\underline\{0\.59\\pm 0\.01\}0\.60±0\.01\\mathbf\{0\.60\\pm 0\.01\}ACAA1\(THIK\_HUMAN\)BA \(↓\\downarrow\)−8\.45±0\.53\-8\.45\\pm 0\.53−8\.58±0\.06\-8\.58\\pm 0\.06−7\.90±0\.02\-7\.90\\pm 0\.02−8\.81±0\.13\-8\.81\\pm 0\.13−8\.78±0\.31\-8\.78\\pm 0\.31−10\.20±0\.11\\mathbf\{\-10\.20\\pm 0\.11\}−8\.41±0\.41\-8\.41\\pm 0\.41−9\.70±0\.23¯\\underline\{\-9\.70\\pm 0\.23\}FA \(↓\\downarrow\)−7\.39±0\.37\-7\.39\\pm 0\.37−6\.67±0\.09\-6\.67\\pm 0\.09N/A−8\.05±0\.16\-8\.05\\pm 0\.16−8\.51±0\.19¯\\underline\{\-8\.51\\pm 0\.19\}−8\.11±0\.08\-8\.11\\pm 0\.08−8\.12±0\.45\-8\.12\\pm 0\.45−8\.78±0\.15\\mathbf\{\-8\.78\\pm 0\.15\}HV \(↑\\uparrow\)0\.48±0\.090\.48\\pm 0\.090\.46±0\.0040\.46\\pm 0\.0040\.33±0\.020\.33\\pm 0\.020\.50±0\.010\.50\\pm 0\.010\.51±0\.0050\.51\\pm 0\.0050\.53±0\.01¯\\underline\{0\.53\\pm 0\.01\}0\.51±0\.020\.51\\pm 0\.020\.54±0\.008\\mathbf\{0\.54\\pm 0\.008\}Avg\. Rank \(↓\\downarrow\)5\.005\.006\.116\.117\.567\.565\.005\.003\.893\.892\.56¯\\underline\{2\.56\}3\.893\.891\.56

#### Evaluation Metrics\.

We consider the following evaluation criteria:

- •Binding Affinity \(BA\): Mean binding affinity \(kcal/mol\) of the top 10 strongest binding molecules to the particular protein target, predicted by Boltz\-2\.
- •Filtered Affinity \(FA\): We further filter our ligand sample by only considering ligands that satisfy sufficient Quantitative Estimate of Drug\-likeness\(Bickerton et al\.,[2012](https://arxiv.org/html/2605.12784#bib.bib1), QED\)and Synthetic Accessibility\(Ertl & Schuffenhauer,[2009](https://arxiv.org/html/2605.12784#bib.bib10), SA\)scores\. We filter byQED\>0\.5\\text\{QED\}\>0\.5andSA<3\.0\\text\{SA\}<3\.0, then take the mean binding affinity of the top 10 strongest binding molecules that survive this filter\. This is a crucial metric that measures the strength of generated ligands that may actually pass the first stage of a real\-world wet lab synthesis\.
- •Hypervolume \(HV\): Measures the Euclidean volume of the 3D\-space \(affinity, QED, SA\) covered by the non\-dominated Pareto frontier formed by the set of all generated molecules\. We scale all objectives to\[0,1\]\[0,1\]and use a reference point of\(1,1,1\)\(1,1,1\)for our calculations\.

#### Results\.

Table[1](https://arxiv.org/html/2605.12784#S4.T1)shows the results of running all baselines and ToolMol on all 3 protein targets, along with a Quantitative Estimate of Drug\-likeness\(Bickerton et al\.,[2012](https://arxiv.org/html/2605.12784#bib.bib1), QED\)maximization objective and a Synthetic Accessibility\(Ertl & Schuffenhauer,[2009](https://arxiv.org/html/2605.12784#bib.bib10), SA\)minimization objective\. We aim to generate molecules that yield the strongest possible binding affinity, that also simultaneously maintain strong\-enough QED and SA properties\. This is motivated by processes in real\-world drug discovery pipelines, where molecules are strongly optimized for binding affinity, but must also be sufficiently drug\-like and synthesizable to be realistic candidates\. We run all GA\-based methods \(Graph\-GA, MOLLEO, ToolMol\) on 5 different seeded sets of initial molecules, and ShinkaEvolve on 3 different seeded sets\. Both methods terminate after 1000 Boltz\-2 oracle evaluations\. We generate 3 seeds of 1000 sampled molecules from Pocket2Mol, TAGMol, and PAFlow for each target, matching the oracle budget for the GA & ShinkaEvolve\.

All metrics are reported on a sample of the entire generated ligand pool\. For each method, we first Butina cluster the full pool of all generated molecules \(with similarity threshold = 0\.6\), then from each resulting cluster, we take the molecule with the strongest binding affinity in that cluster\. This way, we most effectively assess the quality of all structurally unique generations, encouraging diversity in results and favoring consistent strong metrics across a wide region of chemical space\.

ToolMol achieves the best average rank across seven methods and nine metrics\. Notably, it outperforms every baseline in both multi\-objective metrics: filtered mean and hypervolume\. We consider these metrics to be the most important, as they are most pertinent to our multi\-objective problem statement\. ToolMol also consistently outperforms MOLLEO in single\-objective binding affinity\. Additionally, integrating the ToolMol toolbox into ShinkaEvolve yields the strongest binding affinity scores on two targets\. This demonstrates the generalizability of our tool\-calling framework, as our toolbox yields consistent improvements when integrated into two distinct optimization algorithms: classical genetic algorithms and MAP\-Elites\.

We note that while certain generative modeling baselines such as Pocket2Mol and TAGMol exceed our method in pure single\-objective affinity, the drastic drop in filtered binding affinity for those methods reveals that the crucial QED and SA properties are not sufficiently fulfilled\. This implies that the majority of the high scoring affinity compounds are not drug\-like or synthesizable enough to be practical\. This is further supported by the low hypervolume scores for these generative baselines\. Out of all tested methods, ToolMol is the most successful at creating molecular candidates that balance high binding affinity with desirable secondary objectives, reflecting high real\-world utility as a generative framework\.

Table 2:ABFE results of top 15 molecules for ToolMol and MF\-LAL\(Eckmann et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib9)\)\. ToolMol achieves significantly higher ABFE scores for both sets of evaluated molecules\.Methodc\-METBRD4ABFE \(↓\\downarrow\)QED \(↑\\uparrow\)SA \(↓\\downarrow\)ABFE \(↓\\downarrow\)QED \(↑\\uparrow\)SA \(↓\\downarrow\)MF\-LAL−6\.7±3\.1\-6\.7\\pm 3\.10\.63±0\.15¯\\underline\{0\.63\\pm 0\.15\}3\.50±0\.583\.50\\pm 0\.58−6\.2±3\.9\-6\.2\\pm 3\.90\.59±0\.07¯\\underline\{0\.59\\pm 0\.07\}3\.60±0\.553\.60\\pm 0\.55ToolMol−7\.96±2\.77\\mathbf\{\-7\.96\\pm 2\.77\}0\.45±0\.210\.45\\pm 0\.213\.26±0\.36¯\\underline\{3\.26\\pm 0\.36\}−8\.4±3\.9\\mathbf\{\-8\.4\\pm 3\.9\}0\.27±0\.180\.27\\pm 0\.183\.37±0\.45¯\\underline\{3\.37\\pm 0\.45\}ToolMol \(filtered\)−7\.3±3\.8¯\\underline\{\-7\.3\\pm 3\.8\}0\.66±0\.08\\mathbf\{0\.66\\pm 0\.08\}2\.75±0\.16\\mathbf\{2\.75\\pm 0\.16\}−6\.4±3\.5¯\\underline\{\-6\.4\\pm 3\.5\}0\.63±0\.10\\mathbf\{0\.63\\pm 0\.10\}2\.80±0\.14\\mathbf\{2\.80\\pm 0\.14\}

### 4\.2Absolute Binding Free Energy \(ABFE\)

To further demonstrate the usefulness of ToolMol for real\-world drug design, we compute Absolute Binding Free Energy \(ABFE\) scores for its generated molecules\. ABFE uses expensive molecular dynamics simulations to accurately calculate binding free energy\(Heinzelmann & Gilson,[2021](https://arxiv.org/html/2605.12784#bib.bib16)\)\. It is the current gold\-standard for computational binding affinity prediction, and thus reflects a higher degree of accuracy in predicting real\-world experimental activity\. We benchmark against MF\-LAL\(Eckmann et al\.,[2025](https://arxiv.org/html/2605.12784#bib.bib9)\), a state\-of\-the\-art multi\-fidelity approach to drug design that specifically targets ABFE scores through high\-fidelity guided VAE decoding\. Following the exact ABFE setup from MF\-LAL, we evaluate ToolMol on the two targets reported in the MF\-LAL paper, c\-MET and BRD4\. We use two sets of top 15 molecules from ToolMol, one ranked solely on binding energy and the other after applying the QED\>0\.5\>0\.5, SA<3\.0<3\.0filter\. These results are shown in Table[2](https://arxiv.org/html/2605.12784#S4.T2)\. Additional details about parameters used in these ABFE calculations can be found in Appendix[E](https://arxiv.org/html/2605.12784#A5)\.

The top 15 molecules ranked by Boltz\-2 predicted affinity achieve strong ABFE scores for both targets, surpassing MF\-LAL by a large margin\. This comes at a modest cost to secondary objectives, though these molecules still exceed MF\-LAL in synthesizability\. The top 15 filtered ligands yield slightly weaker ABFE scores, but still beat MF\-LAL in every metric, scoring higher on ABFE while maintaining more desirable QED and SA values\. Notably, although ABFE feedback is not explicitly included within our optimization pipeline, we outperform the current state\-of\-the\-art method for high ABFE\-scoring molecules simply by optimizing Boltz\-2 predicted affinity with a tool\-assisted LLM\. The fact that LLM\-generated ligands can achieve state\-of\-the\-art results in this area demonstrates great potential for LLMs to have a real, significant impact in computational drug discovery\.

### 4\.3Ablations

Table 3:Ablations: MOLLEO’s invalid generation rate and ToolMol’s genetic algorithm\. Isolating the impact of the ToolMol toolbox reveals that the tool\-calling process significantly improves results\.TargetMetricMOLLEOMOLLEO\(Retry Failures\)ToolMol\(MOLLEO GA\)ToolMolc\-METBinding Affinity \(↓\\downarrow\)−10\.15±0\.19\-10\.15\\pm 0\.19−9\.98±0\.31\-9\.98\\pm 0\.31−11\.14±0\.20\\mathbf\{\-11\.14\\pm 0\.20\}−11\.00±0\.09¯\\underline\{\-11\.00\\pm 0\.09\}Filtered Affinity \(↓\\downarrow\)−9\.62±0\.11\-9\.62\\pm 0\.11−9\.72±0\.24\-9\.72\\pm 0\.24−10\.22±0\.16¯\\underline\{\-10\.22\\pm 0\.16\}−10\.35±0\.17\\mathbf\{\-10\.35\\pm 0\.17\}Hypervolume \(↑\\uparrow\)0\.60±0\.01¯\\underline\{0\.60\\pm 0\.01\}0\.57±0\.010\.57\\pm 0\.010\.60±0\.02¯\\underline\{0\.60\\pm 0\.02\}0\.62±0\.01\\mathbf\{0\.62\\pm 0\.01\}BRD4Binding Affinity \(↓\\downarrow\)−9\.87±0\.23\-9\.87\\pm 0\.23−9\.67±0\.31\-9\.67\\pm 0\.31−10\.61±0\.33¯\\underline\{\-10\.61\\pm 0\.33\}−10\.64±0\.28\\mathbf\{\-10\.64\\pm 0\.28\}Filtered Affinity \(↓\\downarrow\)−9\.48±0\.19\-9\.48\\pm 0\.19−9\.43±0\.26\-9\.43\\pm 0\.26−9\.87±0\.32¯\\underline\{\-9\.87\\pm 0\.32\}−9\.91±0\.18\\mathbf\{\-9\.91\\pm 0\.18\}Hypervolume \(↑\\uparrow\)0\.59±0\.01¯\\underline\{0\.59\\pm 0\.01\}0\.56±0\.020\.56\\pm 0\.020\.59±0\.01¯\\underline\{0\.59\\pm 0\.01\}0\.60±0\.01\\mathbf\{0\.60\\pm 0\.01\}ACAA1Binding Affinity \(↓\\downarrow\)−8\.41±0\.41\-8\.41\\pm 0\.41−8\.04±0\.12\-8\.04\\pm 0\.12−9\.87±0\.18\\mathbf\{\-9\.87\\pm 0\.18\}−9\.70±0\.23¯\\underline\{\-9\.70\\pm 0\.23\}Filtered Affinity \(↓\\downarrow\)−8\.12±0\.45\-8\.12\\pm 0\.45−7\.93±0\.11\-7\.93\\pm 0\.11−8\.77±0\.24¯\\underline\{\-8\.77\\pm 0\.24\}−8\.78±0\.15\\mathbf\{\-8\.78\\pm 0\.15\}Hypervolume \(↑\\uparrow\)0\.51±0\.020\.51\\pm 0\.020\.49±0\.010\.49\\pm 0\.010\.53±0\.02¯\\underline\{0\.53\\pm 0\.02\}0\.54±0\.008\\mathbf\{0\.54\\pm 0\.008\}Avg\. Rank \(↓\\downarrow\)2\.892\.893\.873\.871\.78¯\\underline\{1\.78\}1\.22\\mathbf\{1\.22\}

We present ablations that specifically highlight the impact of the tool\-calling process, demonstrating the isolated impact of the toolbox provided to the LLM in ToolMol\. We compare with MOLLEO, which is the closest methodological neighbor to ToolMol\.

First, we ablate the effect of the underlying genetic algorithm on the results\. We run ToolMol on the exact MOLLEO genetic algorithm \(MOLLEO GA\); this isolates the particular impact of introducing function\-calling to the crossover/mutation process by removing algorithmic differences of the underlying framework\. Second, we ablate the consequences of the high invalid molecule generation rate faced by MOLLEO\. We observe that across 1000 molecule generations, MOLLEO will consistently yield∼\\sim350 invalid generations, due to formatting issues or syntactically invalid SMILES\. For each invalid generation, MOLLEO immediately falls back on the default Graph\-GA crossover / mutation crossovers\. We can naively reduce the MOLLEO failed generation rate simply by forcing the LLM to retry its generation until it yields a valid result\. This gives a stronger comparison between the MOLLEO generation process and the ToolMol function\-calling process per 1000 generations by eliminating a large portion of the extraneous Graph\-GA impact within MOLLEO\. We give the LLM a maximum of 10 retry steps\.

Table[3](https://arxiv.org/html/2605.12784#S4.T3)compares the original MOLLEO & ToolMol with the two aforementioned ablations\. We discuss two interesting results\. First, simply integrating ToolMol’s toolbox into MOLLEO’s genetic algorithm alone yields significant improvements in binding affinity and QED/SA over MOLLEO’s LLM\-based modifications\. Second, we report that the retry method for MOLLEO drops invalid LLM generations down to nearly0%0\\%, yielding a single digit number of invalid strings every 1000 generations\. This means that the majority of the final ligand pool is generated by the MOLLEO LLM operator\. However, we observe that this does not improve the performance on our evaluation targets, and in fact degrades performance across nearly every metric\. Thus, we further isolate the effect of the ToolMol function\-calling process by focusing solely on the LLM operator in MOLLEO, and demonstrate that ToolMol’s agent\-generated ligands are still superior to MOLLEO’s LLM\-generated ligands in all metrics\.

Figure 3:ToolMol & MOLLEO modification steps and reasoning traces\. MOLLEO fails to execute its planned modifications, while ToolMol successfully executes its ideas\.![Refer to caption](https://arxiv.org/html/2605.12784v1/figures/Figure3.png)
### 4\.4Why does tool\-calling improve performance?

To understand why tool\-calling benefits this black\-box optimization problem, we simulate a single LLM modification step by providing 2 fixed input molecules \(sampled from ZINC 250K\), and compare the reasoning traces of GPT\-OSS\-120B using the ToolMol toolbox to GPT\-OSS\-120B using the MOLLEO modification scheme\. Figure[3](https://arxiv.org/html/2605.12784#S4.F3)shows how the input molecules are modified by these two methods and examines the modifications against the corresponding reasoning traces\.

We observe that in the MOLLEO process, there are critical discrepancies between the planned modifications described in the LLM’s reasoning trace and the actual resulting molecule\. In contrast, every modification made to the molecule by ToolMol is exactly consistent with what the LLM describes in its reasoning trace\. Full reasoning traces for this case study can be found in Appendix[A](https://arxiv.org/html/2605.12784#A1)\.

For further quantitative confirmation, we repeat this experiment on 10 more pairs of distinct input ligands\. Out of 10 generations, ToolMol yields two processes where there is a some discrepancy between the reasoning trace and the resulting modification, while MOLLEO yields seven such erroneous processes, a significant difference \(p=0\.02p=0\.02, by 2\-sided independent t\-test\)\. The two ToolMol errors arise from slightly imprecise parameters passed into the tools \(e\.g\. callingreplace\_atomwith the same element that already exists at that index\) leading to a non\-matching change, while MOLLEO frequently misidentifies structures and inserts incorrect groups into the output\. Thus, in general, the resulting compound modifications generated with ToolMol better match the desired changes outlined by the LLM in its reasoning trace\. We conjecture that, by reducing the potential for error between LLM reasoning and the generated compounds, ToolMol takes better advantage of the vast chemical knowledge that LLMs naturally possess through pretraining\. We believe that this is largely why ToolMol achieves significantly stronger binding affinity results on every protein target that we tested on\.

## 5Discussion & Conclusion

We present ToolMol, an agentic multi\-objective drug discovery framework that iteratively optimizes small molecule ligands for protein binding\. We build a Pareto\-optimizing genetic algorithm that utilizes an exponential\-sampling procedure, and combine it with an LLM that has access to a structured toolbox of seven deterministic, RDKit\-backed operations\. Rather than requiring an LLM to directly generate or modify molecular string encodings \(a task prone to syntactic failure\), this agentic framework reduces the potential failure surface by abstracting away the necessity for the LLM to be syntactically perfect in its outputs\. We achieve state\-of\-the\-art results in three protein\-ligand binding tasks, consistently generating molecules that outscore baselines in predicted binding affinity, QED, and SA\. Despite not being directly optimized for ABFE score anywhere in its pipeline, ToolMol also achieves exceptional results in this area, markedly increasing the power of LLMs as tools for computational drug discovery\. We hypothesize that the inherent chemical knowledge that LLMs hold benefits them in designing more realistic molecules, perhaps more similar to what an actual medicinal chemist might synthesize\. This gives them a distinct advantage over recent generative models, which often generate compounds that lack desired molecular properties, demonstrated by the poor multi\-objective metric scores achieved by diffusion and flow\-based methods on our task\.

#### Limitations

It is important to note the existence of concerns about the accuracy of Boltz\-2 as an affinity predictor\. In particular, recent studies have shown that the performance of Boltz\-2 degrades significantly when evaluating on novel, out\-of\-distribution ligand scaffolds and protein targets\(Li et al\.,[2026](https://arxiv.org/html/2605.12784#bib.bib24); Shepard et al\.,[2026](https://arxiv.org/html/2605.12784#bib.bib38)\)\. Nonetheless, there is evidence that Boltz\-2 outperforms the primary industry alternative, AutoDock Vina, in pose and affinity prediction\(Liu et al\.,[2026](https://arxiv.org/html/2605.12784#bib.bib27)\)\. We also have observed that Boltz\-2 shows better agreement with well\-regarded Absolute Binding Free Energy calculations than AutoDock, evaluated on one of our own relevant protein targets \(see Appendix[E\.2](https://arxiv.org/html/2605.12784#A5.SS2)\)\.

## Impact Statement

This work has the potential to accelerate the early stages of drug discovery by enabling more efficient identification of high\-affinity, drug\-like ligand candidates\. Impacts include reducing the time and cost of lead optimization, improving the quality of computationally generated drug candidates entering wet\-lab validation, and providing a modular, interpretable framework where an LLM’s reasoning for each molecular modification is transparent and traceable through tool calls\.

## Author Disclosures

M\.K\.G\. has an equity interest in and is a cofounder and scientific advisor of VeraChem LLC\. He is also on the scientific advisory boards of Denovicon Therapeutics, In Cerebro, Cold Start Therapeutics, and Beren Therapeutics\. R\.Y and P\.E have equity interests and are co\-founders of Aethermol LLC\.

## Acknowledgment

This work was supported in part by the U\.S\. Army Research Office under Army\-ECASE award W911NF\-07\-R\-0003\-03, the U\.S\. Department Of Energy, Office of Science, IARPA HAYSTAC Program, and NSF Grants \#2205093, \#2146343, \#2134274, \#2441832, CDC\-RFA\-FT\-23\-0069, DARPA AIE FoundSci and DARPA YFA\.

## References

- Bickerton et al\. \(2012\)Bickerton, G\. R\., Paolini, G\. V\., Besnard, J\., Muresan, S\., and Hopkins, A\. L\.Quantifying the chemical beauty of drugs\.*Nature Chemistry*, 4\(2\):90–98, January 2012\.ISSN 1755\-4349\.doi:10\.1038/nchem\.1243\.URL[http://dx\.doi\.org/10\.1038/nchem\.1243](http://dx.doi.org/10.1038/nchem.1243)\.
- Boiko et al\. \(2023\)Boiko, D\. A\., MacKnight, R\., Kline, B\., and Gomes, G\.Autonomous chemical research with large language models\.*Nature*, 624\(7992\):570–578, 2023\.
- Bran et al\. \(2023\)Bran, A\. M\., Cox, S\., Schilter, O\., Baldassari, C\., White, A\. D\., and Schwaller, P\.Chemcrow: Augmenting large\-language models with chemistry tools\.*arXiv preprint arXiv:2304\.05376*, 2023\.
- Choi et al\. \(2026\)Choi, C\., Zou, Y\., Müller, M\., Hao, H\., Kang, Y\., Pérez\-Sánchez, J\. B\., Gustin, I\., Xu, H\., Wang, A\., Vakili, M\. G\., Crebolder, C\., Aspuru\-Guzik, A\., and Bernales, V\.El agente estructural: An artificially intelligent molecular editor, 2026\.URL[https://arxiv\.org/abs/2602\.04849](https://arxiv.org/abs/2602.04849)\.
- Crucitti et al\. \(2024\)Crucitti, D\., Pérez Míguez, C\., Díaz Arias, J\. A\., Fernandez Prada, D\. B\., and Mosquera Orgueira, A\.De novo drug design through artificial intelligence: an introduction\.*Frontiers in Hematology*, Volume 3 \- 2024, 2024\.ISSN 2813\-3935\.doi:10\.3389/frhem\.2024\.1305741\.URL[https://www\.frontiersin\.org/journals/hematology/articles/10\.3389/frhem\.2024\.1305741](https://www.frontiersin.org/journals/hematology/articles/10.3389/frhem.2024.1305741)\.
- Dorna et al\. \(2024\)Dorna, V\., Subhalingam, D\., Kolluru, K\., Tuli, S\., Singh, M\., Singal, S\., Krishnan, N\. M\. A\., and Ranu, S\.Tagmol: Target\-aware gradient\-guided molecule generation, 2024\.URL[https://arxiv\.org/abs/2406\.01650](https://arxiv.org/abs/2406.01650)\.
- Dunn et al\. \(2025\)Dunn, I\., Toft, L\., Katz, T\., Gupta, J\., Shah, R\., Hettiarachchi, R\., and Koes, D\. R\.Omtra: A multi\-task generative model for structure\-based drug design, 2025\.URL[https://arxiv\.org/abs/2512\.05080](https://arxiv.org/abs/2512.05080)\.
- Eckmann et al\. \(2022\)Eckmann, P\., Sun, K\., Zhao, B\., Feng, M\., Gilson, M\. K\., and Yu, R\.Limo: Latent inceptionism for targeted molecule generation, 2022\.URL[https://arxiv\.org/abs/2206\.09010](https://arxiv.org/abs/2206.09010)\.
- Eckmann et al\. \(2025\)Eckmann, P\., Wu, D\., Heinzelmann, G\., Gilson, M\. K\., and Yu, R\.Mf\-lal: Drug compound generation using multi\-fidelity latent space active learning, 2025\.URL[https://arxiv\.org/abs/2410\.11226](https://arxiv.org/abs/2410.11226)\.
- Ertl & Schuffenhauer \(2009\)Ertl, P\. and Schuffenhauer, A\.Estimation of synthetic accessibility score of drug\-like molecules based on molecular complexity and fragment contributions\.*Journal of Cheminformatics*, 1\(1\), June 2009\.ISSN 1758\-2946\.doi:10\.1186/1758\-2946\-1\-8\.URL[http://dx\.doi\.org/10\.1186/1758\-2946\-1\-8](http://dx.doi.org/10.1186/1758-2946-1-8)\.
- Feng et al\. \(2022\)Feng, M\., Heinzelmann, G\., and Gilson, M\. K\.Absolute binding free energy calculations improve enrichment of actives in virtual compound screening\.*Scientific Reports*, 12\(1\), August 2022\.ISSN 2045\-2322\.doi:10\.1038/s41598\-022\-17480\-w\.URL[http://dx\.doi\.org/10\.1038/s41598\-022\-17480\-w](http://dx.doi.org/10.1038/s41598-022-17480-w)\.
- Flam\-Shepherd & Aspuru\-Guzik \(2023\)Flam\-Shepherd, D\. and Aspuru\-Guzik, A\.Language models can generate molecules, materials, and protein binding sites directly in three dimensions as xyz, cif, and pdb files, 2023\.URL[https://arxiv\.org/abs/2305\.05708](https://arxiv.org/abs/2305.05708)\.
- Gómez\-Bombarelli et al\. \(2018\)Gómez\-Bombarelli, R\., Wei, J\. N\., Duvenaud, D\., Hernández\-Lobato, J\. M\., Sánchez\-Lengeling, B\., Sheberla, D\., Aguilera\-Iparraguirre, J\., Hirzel, T\. D\., Adams, R\. P\., and Aspuru\-Guzik, A\.Automatic chemical design using a data\-driven continuous representation of molecules\.*ACS central science*, 4\(2\):268–276, 2018\.
- Guan et al\. \(2024\)Guan, J\., Zhou, X\., Yang, Y\., Bao, Y\., Peng, J\., Ma, J\., Liu, Q\., Wang, L\., and Gu, Q\.Decompdiff: Diffusion models with decomposed priors for structure\-based drug design, 2024\.URL[https://arxiv\.org/abs/2403\.07902](https://arxiv.org/abs/2403.07902)\.
- Guo et al\. \(2023\)Guo, T\., Guo, K\., Nan, B\., Liang, Z\., Guo, Z\., Chawla, N\. V\., Wiest, O\., and Zhang, X\.What can large language models do in chemistry? a comprehensive benchmark on eight tasks, 2023\.URL[https://arxiv\.org/abs/2305\.18365](https://arxiv.org/abs/2305.18365)\.
- Heinzelmann & Gilson \(2021\)Heinzelmann, G\. and Gilson, M\. K\.Automation of absolute protein\-ligand binding free energy calculations for docking refinement and compound evaluation\.*Scientific Reports*, 11\(1\), January 2021\.ISSN 2045\-2322\.doi:10\.1038/s41598\-020\-80769\-1\.URL[http://dx\.doi\.org/10\.1038/s41598\-020\-80769\-1](http://dx.doi.org/10.1038/s41598-020-80769-1)\.
- Hong et al\. \(2016\)Hong, S\. H\., Eun, J\. W\., Choi, S\. K\., Shen, Q\., Choi, W\. S\., Han, J\.\-W\., Nam, S\. W\., and You, J\. S\.Epigenetic reader brd4 inhibition as a therapeutic strategy to suppress e2f2\-cell cycle regulation circuit in liver cancer\.*Oncotarget*, 7\(22\):32628–32640, April 2016\.ISSN 1949\-2553\.doi:10\.18632/oncotarget\.8701\.URL[http://dx\.doi\.org/10\.18632/oncotarget\.8701](http://dx.doi.org/10.18632/oncotarget.8701)\.
- Hoogeboom et al\. \(2022\)Hoogeboom, E\., Satorras, V\. G\., Vignac, C\., and Welling, M\.Equivariant diffusion for molecule generation in 3d, 2022\.URL[https://arxiv\.org/abs/2203\.17003](https://arxiv.org/abs/2203.17003)\.
- Jensen \(2019\)Jensen, J\. H\.A graph\-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space\.*Chemical Science*, 10\(12\):3567–3572, 2019\.ISSN 2041\-6539\.doi:10\.1039/c8sc05372c\.URL[http://dx\.doi\.org/10\.1039/C8SC05372C](http://dx.doi.org/10.1039/C8SC05372C)\.
- Jin et al\. \(2018\)Jin, W\., Barzilay, R\., and Jaakkola, T\.Junction tree variational autoencoder for molecular graph generation\.In*International conference on machine learning*, pp\. 2323–2332\. PMLR, 2018\.
- Joshi et al\. \(2025\)Joshi, C\. K\., Fu, X\., Liao, Y\.\-L\., Gharakhanyan, V\., Miller, B\. K\., Sriram, A\., and Ulissi, Z\. W\.All\-atom diffusion transformers: Unified generative modelling of molecules and materials, 2025\.URL[https://arxiv\.org/abs/2503\.03965](https://arxiv.org/abs/2503.03965)\.
- Lange et al\. \(2025\)Lange, R\. T\., Imajuku, Y\., and Cetin, E\.Shinkaevolve: Towards open\-ended and sample\-efficient program evolution, 2025\.URL[https://arxiv\.org/abs/2509\.19349](https://arxiv.org/abs/2509.19349)\.
- Lee et al\. \(2023\)Lee, S\., Jo, J\., and Hwang, S\. J\.Exploring chemical space with score\-based out\-of\-distribution generation, 2023\.URL[https://arxiv\.org/abs/2206\.07632](https://arxiv.org/abs/2206.07632)\.
- Li et al\. \(2026\)Li, Y\., Zhan, R\.\-H\., Rao, J\., Liu, M\., Sang, P\., Zeng, X\., Zheng, M\., Li, X\., and Yang, L\.Structure\-informed machine learning for drug discovery: a task\-centric perspective\.*Brief\. Bioinform\.*, 27\(1\), January 2026\.
- Liu et al\. \(2025\)Liu, C\., Vadgama, S\., Ruhe, D\., Bekkers, E\., and Forré, P\.Clifford group equivariant diffusion models for 3d molecular generation, 2025\.URL[https://arxiv\.org/abs/2504\.15773](https://arxiv.org/abs/2504.15773)\.
- Liu et al\. \(2007\)Liu, T\., Lin, Y\., Wen, X\., Jorissen, R\. N\., and Gilson, M\. K\.Bindingdb: a web\-accessible database of experimentally determined protein\-ligand binding affinities\.*Nucleic Acids Research*, 35\(Database\):D198–D201, January 2007\.ISSN 1362\-4962\.doi:10\.1093/nar/gkl999\.URL[http://dx\.doi\.org/10\.1093/nar/gkl999](http://dx.doi.org/10.1093/nar/gkl999)\.
- Liu et al\. \(2026\)Liu, Y\., Tang, H\., Niu, T\., and Wang, J\.A comparative study of deep learning and classical modeling approaches for protein–ligand binding pose and affinity prediction in coronavirus main proteases\.*Journal of Chemical Information and Modeling*, 66\(1\):731–743, 2026\.doi:10\.1021/acs\.jcim\.5c02481\.URL[https://doi\.org/10\.1021/acs\.jcim\.5c02481](https://doi.org/10.1021/acs.jcim.5c02481)\.PMID: 41429653\.
- Ma et al\. \(2024\)Ma, T\., Lin, X\., Li, T\., Li, C\., Chen, L\., Zhou, P\., Cai, X\., Yang, X\., Zeng, D\., Cao, D\., and Zeng, X\.Y\-mol: A multiscale biomedical knowledge\-guided large language model for drug development, 2024\.URL[https://arxiv\.org/abs/2410\.11550](https://arxiv.org/abs/2410.11550)\.
- Mouret & Clune \(2015\)Mouret, J\.\-B\. and Clune, J\.Illuminating search spaces by mapping elites, 2015\.URL[https://arxiv\.org/abs/1504\.04909](https://arxiv.org/abs/1504.04909)\.
- Noh et al\. \(2022\)Noh, J\., Jeong, D\.\-W\., Kim, K\., Han, S\., Lee, M\., Lee, H\., and Jung, Y\.Path\-aware and structure\-preserving generation of synthetically accessible molecules\.In Chaudhuri, K\., Jegelka, S\., Song, L\., Szepesvari, C\., Niu, G\., and Sabato, S\. \(eds\.\),*Proceedings of the 39th International Conference on Machine Learning*, volume 162 of*Proceedings of Machine Learning Research*, pp\. 16952–16968\. PMLR, 17–23 Jul 2022\.URL[https://proceedings\.mlr\.press/v162/noh22a\.html](https://proceedings.mlr.press/v162/noh22a.html)\.
- Oestreich et al\. \(2025\)Oestreich, M\., Merdivan, E\., Lee, M\., Schultze, J\. L\., Piraud, M\., and Becker, M\.DrugDiff: small molecule diffusion model with flexible guidance towards molecular properties\.*J\. Cheminform\.*, 17\(1\):23, February 2025\.
- OpenAI et al\. \(2025\)OpenAI, :, Agarwal, S\., Ahmad, L\., Ai, J\., Altman, S\., Applebaum, A\., Arbus, E\., Arora, R\. K\., Bai, Y\., Baker, B\., Bao, H\., Barak, B\., Bennett, A\., Bertao, T\., Brett, N\., Brevdo, E\., Brockman, G\., Bubeck, S\., Chang, C\., Chen, K\., Chen, M\., Cheung, E\., Clark, A\., Cook, D\., Dukhan, M\., Dvorak, C\., Fives, K\., Fomenko, V\., Garipov, T\., Georgiev, K\., Glaese, M\., Gogineni, T\., Goucher, A\., Gross, L\., Guzman, K\. G\., Hallman, J\., Hehir, J\., Heidecke, J\., Helyar, A\., Hu, H\., Huet, R\., Huh, J\., Jain, S\., Johnson, Z\., Koch, C\., Kofman, I\., Kundel, D\., Kwon, J\., Kyrylov, V\., Le, E\. Y\., Leclerc, G\., Lennon, J\. P\., Lessans, S\., Lezcano\-Casado, M\., Li, Y\., Li, Z\., Lin, J\., Liss, J\., Lily, Liu, Liu, J\., Lu, K\., Lu, C\., Martinovic, Z\., McCallum, L\., McGrath, J\., McKinney, S\., McLaughlin, A\., Mei, S\., Mostovoy, S\., Mu, T\., Myles, G\., Neitz, A\., Nichol, A\., Pachocki, J\., Paino, A\., Palmie, D\., Pantuliano, A\., Parascandolo, G\., Park, J\., Pathak, L\., Paz, C\., Peran, L\., Pimenov, D\., Pokrass, M\., Proehl, E\., Qiu, H\., Raila, G\., Raso, F\., Ren, H\., Richardson, K\., Robinson, D\., Rotsted, B\., Salman, H\., Sanjeev, S\., Schwarzer, M\., Sculley, D\., Sikchi, H\., Simon, K\., Singhal, K\., Song, Y\., Stuckey, D\., Sun, Z\., Tillet, P\., Toizer, S\., Tsimpourlas, F\., Vyas, N\., Wallace, E\., Wang, X\., Wang, M\., Watkins, O\., Weil, K\., Wendling, A\., Whinnery, K\., Whitney, C\., Wong, H\., Yang, L\., Yang, Y\., Yasunaga, M\., Ying, K\., Zaremba, W\., Zhan, W\., Zhang, C\., Zhang, B\., Zhang, E\., and Zhao, S\.gpt\-oss\-120b & gpt\-oss\-20b model card, 2025\.URL[https://arxiv\.org/abs/2508\.10925](https://arxiv.org/abs/2508.10925)\.
- Organ & Tsao \(2011\)Organ, S\. L\. and Tsao, M\.\-S\.An overview of the c\-met signaling pathway\.*Therapeutic Advances in Medical Oncology*, 3\(1 suppl\):S7–S19, November 2011\.ISSN 1758\-8359\.doi:10\.1177/1758834011422556\.URL[http://dx\.doi\.org/10\.1177/1758834011422556](http://dx.doi.org/10.1177/1758834011422556)\.
- Passaro et al\. \(2025\)Passaro, S\., Corso, G\., Wohlwend, J\., Reveiz, M\., Thaler, S\., Somnath, V\. R\., Getz, N\., Portnoi, T\., Roy, J\., Stark, H\., Kwabi\-Addo, D\., Beaini, D\., Jaakkola, T\., and Barzilay, R\.Boltz\-2: Towards accurate and efficient binding affinity prediction\.June 2025\.doi:10\.1101/2025\.06\.14\.659707\.URL[http://dx\.doi\.org/10\.1101/2025\.06\.14\.659707](http://dx.doi.org/10.1101/2025.06.14.659707)\.
- Peng et al\. \(2025\)Peng, X\., Luo, S\., Guan, J\., Xie, Q\., Peng, J\., and Ma, J\.Pocket2mol: Efficient molecular sampling based on 3d protein pockets, 2025\.URL[https://arxiv\.org/abs/2205\.07249](https://arxiv.org/abs/2205.07249)\.
- Pettersen et al\. \(2020\)Pettersen, E\. F\., Goddard, T\. D\., Huang, C\. C\., Meng, E\. C\., Couch, G\. S\., Croll, T\. I\., Morris, J\. H\., and Ferrin, T\. E\.¡scp¿ucsf chimerax¡/scp¿: Structure visualization for researchers, educators, and developers\.*Protein Science*, 30\(1\):70–82, October 2020\.ISSN 1469\-896X\.doi:10\.1002/pro\.3943\.URL[http://dx\.doi\.org/10\.1002/pro\.3943](http://dx.doi.org/10.1002/pro.3943)\.
- Sheikholeslami et al\. \(2025\)Sheikholeslami, M\., Mazrouei, N\., Gheisari, Y\., Fasihi, A\., Irajpour, M\., and Motahharynia, A\.Druggen enhances drug discovery with large language models and reinforcement learning\.*Scientific Reports*, 15\(1\), 2025\.ISSN 2045\-2322\.doi:10\.1038/s41598\-025\-98629\-1\.URL[http://dx\.doi\.org/10\.1038/s41598\-025\-98629\-1](http://dx.doi.org/10.1038/s41598-025-98629-1)\.
- Shepard et al\. \(2026\)Shepard, V\., Musin, A\., Chebykina, K\., Zeninskaya, N\. A\., Mistryukova, L\., Avchaciov, K\., and Fedichev, P\. O\.Harvest: Unlocking the dark bioactivity data of pharmaceutical patents via agentic ai\.March 2026\.doi:10\.64898/2026\.03\.15\.711910\.URL[http://dx\.doi\.org/10\.64898/2026\.03\.15\.711910](http://dx.doi.org/10.64898/2026.03.15.711910)\.
- Sterling & Irwin \(2015\)Sterling, T\. and Irwin, J\. J\.Zinc 15 – ligand discovery for everyone\.*Journal of Chemical Information and Modeling*, 55\(11\):2324–2337, November 2015\.ISSN 1549\-960X\.doi:10\.1021/acs\.jcim\.5b00559\.URL[http://dx\.doi\.org/10\.1021/acs\.jcim\.5b00559](http://dx.doi.org/10.1021/acs.jcim.5b00559)\.
- Trott & Olson \(2009\)Trott, O\. and Olson, A\. J\.Autodock vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading\.*Journal of Computational Chemistry*, 31\(2\):455–461, June 2009\.ISSN 1096\-987X\.doi:10\.1002/jcc\.21334\.URL[http://dx\.doi\.org/10\.1002/jcc\.21334](http://dx.doi.org/10.1002/jcc.21334)\.
- Vadgama et al\. \(2026\)Vadgama, S\., Islam, M\. M\., Buracas, D\., Shewmake, C\. A\., Moskalev, A\., and Bekkers, E\. J\.Probing equivariance and symmetry breaking in convolutional networks\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2026\.URL[https://openreview\.net/forum?id=ghyYc7hgSU](https://openreview.net/forum?id=ghyYc7hgSU)\.
- Wang et al\. \(2025\)Wang, H\., Skreta, M\., Ser, C\.\-T\., Gao, W\., Kong, L\., Strieth\-Kalthoff, F\., Duan, C\., Zhuang, Y\., Yu, Y\., Zhu, Y\., Du, Y\., Aspuru\-Guzik, A\., Neklyudov, K\., and Zhang, C\.Efficient evolutionary search over chemical space with large language models, 2025\.URL[https://arxiv\.org/abs/2406\.16976](https://arxiv.org/abs/2406.16976)\.
- White \(2023\)White, A\. D\.The future of chemistry is language\.*Nature Reviews Chemistry*, 7\(7\):457–458, May 2023\.ISSN 2397\-3358\.doi:10\.1038/s41570\-023\-00502\-0\.URL[http://dx\.doi\.org/10\.1038/s41570\-023\-00502\-0](http://dx.doi.org/10.1038/s41570-023-00502-0)\.
- Zhou et al\. \(2025\)Zhou, J\., Qian, H\., Tu, S\., and Xu, L\.Prior\-guided flow matching for target\-aware molecule design with learnable atom number, 2025\.URL[https://arxiv\.org/abs/2509\.01486](https://arxiv.org/abs/2509.01486)\.
- Zhou et al\. \(2024\)Zhou, X\., Cheng, X\., Yang, Y\., Bao, Y\., Wang, L\., and Gu, Q\.Decompopt: Controllable and decomposed diffusion models for structure\-based molecular optimization, 2024\.URL[https://arxiv\.org/abs/2403\.13829](https://arxiv.org/abs/2403.13829)\.
- Zhu et al\. \(2023\)Zhu, Y\., Wu, J\., Hu, C\., Yan, J\., Hsieh, C\.\-Y\., Hou, T\., and Wu, J\.Sample\-efficient multi\-objective molecular optimization with gflownets, 2023\.URL[https://arxiv\.org/abs/2302\.04040](https://arxiv.org/abs/2302.04040)\.

## Appendix ACase Study

We provide further details on the case study presented in Figure[3](https://arxiv.org/html/2605.12784#S4.F3)of the main paper\. Below are the initial two molecules given to both MOLLEO and ToolMol\.

![[Uncaptioned image]](https://arxiv.org/html/2605.12784v1/figures/lig1.png)

![[Uncaptioned image]](https://arxiv.org/html/2605.12784v1/figures/lig2.png)

First, we see how the LLM makes modificationswithoutthe ToolMol toolbox\. Here are excerpts from its reasoning trace that describe how it intends to modify the molecule:

MOLLEO Reasoning Trace”…We can propose a hybrid: combine quinazoline core with a morpholine or piperazine side chain \(like in many kinase inhibitors\)\. Add halogen for potency, add \-OCH3 groups for interactions\. Use a hinge\-binding motif: a N,N\-dimethylcarboxamide? Actually c\-MET hinge binder often uses 4\-anilino quinazoline\. We could take quinazoline from first \(c1nc2ccc\(Cl\)cn2c1C=O\), but modify the N\-linked hydrazone to an anilino group with fluorine\. Add a piperidine amide from second\. Proposed SMILES: something like: ”CCc1nc2ccc\(Cl\)cn2c1C\(=O\)NCC\(=O\)N1CCCC1c1ccc\(F\)cc1O”\. This combines the quinazolinone core with a piperidine \(N1CCCC1\) attached via amide to an aryl with fluorine…Letś design stepwise: Core: CCc1nc2ccc\(Cl\)cn2c1C\(=O\)N — attach a linker: N\-CH2\-CH2\-C\(=O\)\-N1CCCC1 \(piperidine\)\. Then attach a phenyl ring to piperidine nitrogen…Thus SMILES: ”CCc1nc2ccc\(Cl\)cn2c1C\(=O\)NCC\(=O\)N1CCCC1c1c\(F\)cc\(OC\)cc1OC”\.”

Below is a visualization of the molecule generated from this process:

Figure 4:This figure shows the initial molecules, and resultant molecules after LLM modifications using MOLLEO and ToolMol\. We see that MOLLEO fails to generate the required molecule![Refer to caption](https://arxiv.org/html/2605.12784v1/figures/molleo_lig.png)We observe that while many parts of the final molecule are consistent with what is described by the reasoning trace, there are certain parts that are entirely inconsistent with the LLM’s planned modifications\. For instance, it insists that it is using the piperidine ring from ligand 2, but the ring at indices 18\-22 is actually a 5\-membered pyrrolidine ring; it hallucinates the SMILES for piperidine multiple times throughout its reasoning trace\. The resulting molecule is actually a reasonable crossover between the inputs, but is simply not fully consistent with the reasoning trace that generated it\. We observe this trend commonly within MOLLEO generations, where the resulting modification\(s\) do not exactly align with the desired theoretical modifications that the LLM envisioned\.

We contrast this to the ToolMol function\-calling process\. Here are the reasoning traces for multiple tool steps\.

ToolMol Reasoning TraceNeed to use tools to make modifications\. First decide cut points\. Use crossover\_molecules tool\. Choose roughly half indices with high centrality not in ring? Must not split at ring index…For ligand1, perhaps cut at atom 12 \(C, not ring\) centrality high \(0\.54\)\. For ligand2, cut at atom 7 maybe \(C, not ring\) centrality 0\.48\. Use those\. Weĺl call crossover\.…Typically, we might want to increase H\-bond donors/acceptors, maybe add a fluorine, or replace a chlorine with fluorine to improve metabolic stability\. Also maintain drug\-like properties…Look at structure: Thereś a fused heterocycle \(maybe quinazoline\)\. The chlorine is on aromatic ring\. We could replace with fluorine\. Use replace\_atom tool\.…Goal: better c\-MET binding maybe increase H\-bond donors/acceptors, aromaticity, add fluorine, etc\. Looking at structure: It has fused heteroaryl \(pyrimidine\) etc\. Maybe add a methyl group on the phenyl ring to increase hydrophobic interaction\. Which atom index corresponds to aromatic carbon with substitution? Letś inspect aromatic ring indices…So we can add a methyl to a carbon with a hydrogen, e\.g\., atom 5 has substitutable\_hydrogens 1 \(per data\)\. Itś a carbon in aromatic ring with one H\. So we can add methyl group via add\_functional\_group\.

This describes a 3\-step function\-calling process that the LLM carried out, first executing a crossover, then making two minor modifications to the resulting molecules\. Below is the final generated molecule:

![[Uncaptioned image]](https://arxiv.org/html/2605.12784v1/figures/toolmol_final.png)In contrast to the MOLLEO generation process, we observe that every modification made to the molecule is perfectly consistent with what the LLM describes in its reasoning trace\. We can see that the ligands were cut at the correct desired indices \(12 on ligand 1 and 7 on ligand 2\) and merged together to form the desired crossover\. Then the model describes replacing the chlorine with fluorine to ”improve metabolic stability”, which is correctly carried out at index 9\. Finally, it wishes to add a methyl group to ”increase hydrophobic interaction”, which is also correctly done at index 5/6\.

## Appendix BFull ToolMol Setup

### B\.1Toolbox

In this section, we provide more detailed descriptions of each tool in the ToolMol toolbox, and specifications for usage of the parameters\.

1. 1\.add\_atom\(mol, idx, element, bond\) Adds a single atom of typeelementto the current molecule\. The LLM provides the currentmolas a SMILES string\. The LLM must also provide the index of the existing atom that will receive a bond to the newly added atom, as well as the type of the bond \(i\.e\. single, double, or triple\)\.
2. 2\.replace\_atom\(mol, idx, element\) Replaces the atom at the specified index in the current molecule by a single atom of typeelement, attempting to preserve all existing bonds\.
3. 3\.add\_functional\_group\(mol, idx, group, bond\) Adds a single predefined functional group to the current molecule at the specified index\. We build a table mapping common functional groups and rings to their SMILES encodings \(e\.g\. methyl, propyl, phenyl, etc\), and expose the table to the LLM\. It specifies thegroupparameter in English, and allows the SMILES specification to be handled by the table\. Each table entry has a predefined attachment point, and the LLM additionally specifies the bond type for this attachment\.
4. 4\.add\_substructure\(mol, idx, substructure, bond\) Adds a manually\-specified SMILES string substructure to the current molecule at the specified index\. Unlike the previous function, the LLM has full flexibility in adding a custom substructure\. It must specify an attachment point insubstructurein standard\[\*1\]notation, as well as the bond type\.
5. 5\.replace\_substructure\(mol, idx, old\_substructure, new\_substructure\) Replaces an existing substructure withinmolwith a new substructure\. The LLM specifiesold\_substructureusing SMARTS \(SMILES Arbitrary Target Specification\) notation, and provides a new custom substructure as a SMILES\. The index parameter here is used as an “anchor” to remove ambiguity if the SMARTS matches multiple substructures; only the substructure that contains theidxatom will be selected\. Due to the complex nature of removing an entire substructure and reattaching a new one, this function is generally only used on terminal substructures, i\.e\. substructures that result in only one broken bond when completely removed\.
6. 6\.remove\_substructure\(mol, idx, substructure\) Removes an entire existing substructure withinmol, specified by SMARTS notation\. Similar toadd\_substructure, the LLM specifies an anchoridxto remove ambiguity in the case of multiple matches\. Due to the possibility of creating fragments when removing central substructures, this function is also primarily used on terminal substructures\.
7. 7\.crossover\_molecules\(mol1, idx1, mol2, idx2\) Takes in two separate molecules as input and performs a crossover operation on them\. The LLM specifies 2 indices, one for each molecule\. The function then attempts to split both molecules at their respective indices\. If successful, this results in 4 molecular fragments, from which one of the 4 possible crossover combinations is randomly chosen and returned\. This function fails if either index does not result in 2 distinct fragments after the split operation \(e\.g\. if the index is part of a ring structure\)\.

### B\.2ToolMol Prompts

Next, we share detailed information about the prompts and information given to the LLM in ToolMol\. First, we introduce two functions that provide important structure and property\-based information about an input molecule\. These functions are non\-callable by the LLM, and are instead deterministically provided before every LLM modification step\.

1. 1\.get\_ligand\_structure\(mol\) This function returns the following information for every single atom in the input molecule:atom\_index,element,num\_substitutable\_hydrogens,num\_available\_valences,num\_neighboring\_atoms,neighbor\_indices,is\_in\_ring, andcentrality\. We observe that the LLM is able to parse this dense atom\-wise information quite well, and find this is sufficient to provide the LLM with a solid structural understanding of the ligand\.centralityis a measure of how central the atom is with respect to the molecular graph, and is calculated with the betweenness centrality formula\. This is an important measure for the LLM to choose a crossover location, as we ideally want to split each molecule as evenly as possible to create the most reasonably\-sized fragments\. Choosing an atom with high centrality is likely to result in an evenly\-split molecule\.
2. 2\.calculate\_properties\(mol\) This function returns basic molecular descriptors for the input directly provided by RDKit\. It returns:QED,SA,molecular\_weight,LogP,TPSA,num\_HBond\_donors,num\_HBond\_acceptors,num\_rotatable\_bonds, andnum\_aromatic\_rings\.This information primarily helps guide the model’s decisions at intermediate steps\.

First, below is the system prompt given to the LLM for every modification\.

ToolMol System PromptYou are a molecular design agent\.You may ONLY modify molecules using tools\.Only make one modification at a time\.Read the parameter descriptions for the tools very carefully\.Always ensure that your modifications don’t break valence rules and do not result in a fragmented molecule\.

Next is the full initial ToolMol prompt, given to the LLM at the beginning of a modification step\.

Initial ToolMol Prompt”Goal: I want to improve Binding Affinity against \[PROTEIN\_TARGET\], minimize SA \(Synthetic Accessibility\), and maximize QED\. Recall that a more negative binding affinity is better, and a more positive binding affinity is worse\. Please propose a new molecule better than the current molecule\. I have given you two candidate ligands\. Please propose a new molecule that binds better to \[PROTEIN\_TARGET\]\. You are encouraged to make a crossover between the candidate molecules on the first step, then mutate the resulting molecule\. Only make a few modifications \(at most 3\), then respond with FINAL\_ANSWER\. Do not let molecular weight exceed 700\.1\. \[LIGAND 1\]Binding Affinity against \[PROTEIN\_TARGET\]: xSA \(Synthetic Accessibility\): xQED: x2\. \[LIGAND 2\]Binding Affinity against \[PROTEIN\_TARGET\]: xSA \(Synthetic Accessibility\): xQED: xLigand structure and possible attachment points for ligand 1: \[get\_ligand\_structure\(mol1\)\]Ligand structure and possible attachment points for ligand 2: \[get\_ligand\_structure\(mol2\)\]Molecule properties for ligand 1: \[calculate\_properties\(mol1\)\]Molecule properties for ligand 2: \[calculate\_properties\(mol2\)\] ”

We outline a multi\-objective goal for the LLM, then provide both initial input ligands and the structure / property information using the functions described above\. At this initial step, the model chooses to either use the crossover tool to create a new combination, or just uses another tool to modify one of the given input ligands\. In either case, one intermediate ligand is produced\. We append the tool called and the result to the conversation history\.

Following this, we append the following intermediate prompt to the conversation history\.

Intermediate ToolMol PromptsOutput FINAL\_ANSWER if you have made sufficient modifications \(make at most 3\)\. Ensure that desired properties are maintained\.Current SMILES: \[CURR\_LIGAND\]Ligand structure and possible attachment points: \[get\_ligand\_structure\(mol\)\]Molecule properties: \[calculate\_properties\(mol\)\]

The model can choose to add additional mutations, and the intermediate prompt is appended to the conversation history after every modification step\.

## Appendix CAdditional Ablations

We provide 3 additional ablations regarding the setup of the genetic algorithm in ToolMol\.

#### Population & Offspring Size

We explore using a larger population size of 120 & and larger offspring size of 70, as well a smaller population size of 12 & offspring size of 7\. This is in contrast to the population size of 60 & offspring size of 35 used in the ToolMol setup for the main paper\.

#### Pareto Sampling

We also explore an alternate sampling method of choosing parent molecules for crossover and mutation\. Instead of sampling proportional to an exponentiated weighted scalar, we consider an approach based on Pareto ordering\. Given all moleculesℳc\\mathcal\{M\}\_\{c\}in the current population, we can define multiple Pareto frontiers; letP1P\_\{1\}be the set containing the non\-dominated frontier onℳc\\mathcal\{M\}\_\{c\}\. ThenP2P\_\{2\}is the set containing the non\-dominated Pareto frontier onℳc∖P1\\mathcal\{M\}\_\{c\}\\setminus P\_\{1\}, i\.e\. the next non\-dominated frontier obtained after removing all molecules from the true non\-dominated frontier from consideration\. ThenP3P\_\{3\}can be defined similarly as the set containing the non\-dominated Pareto frontier onℳc∖\(P1∪P2\)\\mathcal\{M\}\_\{c\}\\setminus\(P\_\{1\}\\cup P\_\{2\}\)\. For this alternate sampling method, we first select the top 3 Pareto frontiers \(P1P\_\{1\},P2P\_\{2\},P3P\_\{3\}\) to proceed to the next generation population after a round of offspring\. Then, sampling is determined by each molecule’s ”rank” in the Pareto ordering\. Formally, for a given population of sizenn, the probability of a particular moleculemjm\_\{j\}to be selected for crossover / mutation isP\(mj\)=g\(xj\)∑ig\(xi\),i∈\{1,…,n\}P\(m\_\{j\}\)=\\frac\{g\(x\_\{j\}\)\}\{\\sum\_\{i\}g\(x\_\{i\}\)\},i\\in\\\{1,\.\.\.,n\\\}, wherexi=\{1ifmi∈P1,2ifmi∈P2,3ifmi∈P3x\_\{i\}=\\\{1\\text\{ if \}m\_\{i\}\\in P\_\{1\}\\text\{, \}2\\text\{ if \}m\_\{i\}\\in P\_\{2\}\\text\{, \}3\\text\{ if \}m\_\{i\}\\in P\_\{3\}\}, andg\(x\)=11\+xg\(x\)=\\frac\{1\}\{1\+x\}\. We consider this sampling method because it aligns well with the Pareto approach we take to multi\-objective optimization in the rest of the genetic algorithm\.

Table[4](https://arxiv.org/html/2605.12784#A3.T4)shows the results of the three aforementioned ablations, compared against the ToolMol setup shown in the main paper\. We run all ablations on 3 different seeded initial populations\.

Table 4:Additional Ablations on ToolMol GA: Population Size & Pareto\-rank SamplingTargetMetricToolMol\(12 / 7\)ToolMol\(120 / 70\)ToolMol\(Pareto Sampling\)ToolMolc\-METBinding Affinity \(↓\\downarrow\)−11\.14±0\.20\\mathbf\{\-11\.14\\pm 0\.20\}−10\.68±0\.12\-10\.68\\pm 0\.12−11\.07±0\.19¯\\underline\{\-11\.07\\pm 0\.19\}−11\.00±0\.09\-11\.00\\pm 0\.09Filtered Affinity \(↓\\downarrow\)−10\.22±0\.16\-10\.22\\pm 0\.16−10\.13±0\.09\-10\.13\\pm 0\.09−10\.27±0\.06¯\\underline\{\-10\.27\\pm 0\.06\}−10\.35±0\.17\\mathbf\{\-10\.35\\pm 0\.17\}Hypervolume \(↑\\uparrow\)0\.60±0\.02¯\\underline\{0\.60\\pm 0\.02\}0\.59±0\.010\.59\\pm 0\.010\.60±0\.01¯\\underline\{0\.60\\pm 0\.01\}0\.62±0\.01\\mathbf\{0\.62\\pm 0\.01\}BRD4Binding Affinity \(↓\\downarrow\)−10\.61±0\.33\-10\.61\\pm 0\.33−10\.79±0\.23¯\\underline\{\-10\.79\\pm 0\.23\}−10\.95±0\.14\\mathbf\{\-10\.95\\pm 0\.14\}−10\.64±0\.28\-10\.64\\pm 0\.28Filtered Affinity \(↓\\downarrow\)−9\.87±0\.32¯\\underline\{\-9\.87\\pm 0\.32\}−9\.87±0\.29¯\\underline\{\-9\.87\\pm 0\.29\}−9\.67±0\.19\-9\.67\\pm 0\.19−9\.91±0\.18\\mathbf\{\-9\.91\\pm 0\.18\}Hypervolume \(↑\\uparrow\)0\.59±0\.010\.59\\pm 0\.010\.60±0\.02\\mathbf\{0\.60\\pm 0\.02\}0\.59±0\.0040\.59\\pm 0\.0040\.60±0\.01\\mathbf\{0\.60\\pm 0\.01\}ACAA1Binding Affinity \(↓\\downarrow\)−9\.87±0\.18\\mathbf\{\-9\.87\\pm 0\.18\}−9\.54±0\.08\-9\.54\\pm 0\.08−9\.87±0\.19\\mathbf\{\-9\.87\\pm 0\.19\}−9\.70±0\.23\-9\.70\\pm 0\.23Filtered Affinity \(↓\\downarrow\)−8\.77±0\.24\-8\.77\\pm 0\.24−8\.86±0\.25\\mathbf\{\-8\.86\\pm 0\.25\}−8\.81±0\.32¯\\underline\{\-8\.81\\pm 0\.32\}−8\.78±0\.15\-8\.78\\pm 0\.15Hypervolume \(↑\\uparrow\)0\.53±0\.020\.53\\pm 0\.020\.54±0\.001\\mathbf\{0\.54\\pm 0\.001\}0\.54±0\.009\\mathbf\{0\.54\\pm 0\.009\}0\.54±0\.008\\mathbf\{0\.54\\pm 0\.008\}Avg\. Rank \(↓\\downarrow\)2\.672\.672\.562\.562\.00¯\\underline\{2\.00\}1\.89\\mathbf\{1\.89\}

We observe that the configuration of ToolMol in the main paper beats all aforementioned ablations on average across all targets\. Thus, we choose to report the values and setup of the rightmost column in comparison to other baselines in our main analysis, although it is likely that the Pareto sampling method is not significantly weaker, and even beats the exponential fitness setup on particular targets\.

We also briefly test an alternate value for the exponential constantkkused in parent sampling\. We usek=10k=10in the main paper, and test that againstk=ek=ehere for the c\-MET target\. Results are shown in Table[5](https://arxiv.org/html/2605.12784#A3.T5)\.

Table 5:Ablation on exponential constantkk:1010vseeTargetMetrick=ek=ek=10k=10c\-METBinding Affinity \(↓\\downarrow\)−11\.12±0\.18\\mathbf\{\-11\.12\\pm 0\.18\}−11\.00±0\.09\-11\.00\\pm 0\.09Filtered Affinity \(↓\\downarrow\)10\.32±0\.15\\\-10\.32\\pm 0\.15−10\.35±0\.17\\mathbf\{\-10\.35\\pm 0\.17\}Hypervolume \(↑\\uparrow\)0\.61±0\.010\.61\\pm 0\.010\.62±0\.01\\mathbf\{0\.62\\pm 0\.01\}

We observe that while there is very little difference between the 2 constants, usingk=10k=10marginally improves the multi\-objective metrics we care most about, and thus we choose to report that configuration in the main paper\.

## Appendix DAlphaEvolve / ShinkaEvolve for Drug Discovery

In this section, we outline the ShinkaEvolve\-inspired algorithm we built for small\-molecule drug discovery\. It is a non\-sophisticated MAP\-Elites approach with independent islands and random migration events\. We maintain 4 separate MAP\-Elites grids that are referred to as islands\. Each grid is actually 1\-dimensional, and stores molecule candidates within bins based on their molecular weight\. There are 50 bins evenly discretizing molecular weight within the range \[200, 900\]\. Any molecules outside of that range are placed into the outermost bins\.

The core LLM modification step occurs when we sample two molecules from a particular grid for crossover / mutation operations, resulting in one new molecule\. The sampling procedure closely follows ShinkaEvolve, where parent molecules are sampled based on a balance between fitness and how often that molecule has already been sampled for reproduction\. LetΦ\(m\)=∑ifi\(m\)\\Phi\(m\)=\\sum\_\{i\}f\_\{i\}\(m\)be the fitness of a molecule, wherefi\(m\)f\_\{i\}\(m\)is theiith objective scaled to\[0,1\]\[0,1\]\. Letα=median\(Φ\(m1\),…,Φ\(mn\)\)\\alpha=\\text\{median\}\(\\Phi\(m\_\{1\}\),\.\.\.,\\Phi\(m\_\{n\}\)\)for allmim\_\{i\}currently in the MAP\-Elites grid\. Then letsi=σ\(λ∗\(Φ\(mi\)−α\)\)s\_\{i\}=\\sigma\(\\lambda\*\(\\Phi\(m\_\{i\}\)\-\\alpha\)\), whereσ\\sigmais the sigmoid function andλ\\lambdais a constant that controls selection pressure\. We useλ=1\\lambda=1for our experiments\. Further, lethi=11\+N\(mi\)h\_\{i\}=\\frac\{1\}\{1\+N\(m\_\{i\}\)\}, whereN\(mi\)N\(m\_\{i\}\)counts the number of timesmim\_\{i\}has already been chosen for reproduction\. Thus for each moleculemim\_\{i\}, we havesis\_\{i\}which benefits molecules with high fitness, andhih\_\{i\}which benefits molecules that have not been chosen frequently\. The final probability distribution is constructed byP\(mi\)=wi∑jwj,j∈\{1,…,n\}P\(m\_\{i\}\)=\\frac\{w\_\{i\}\}\{\\sum\_\{j\}w\_\{j\}\},j\\in\\\{1,\.\.\.,n\\\}, wherewi=si∗hiw\_\{i\}=s\_\{i\}\*h\_\{i\}\. After 2 molecules are selected according to this sampling formulation, they undergo LLM crossover / mutation steps either in a similar manner to MOLLEO, or with the ToolMol toolbox\.

When a new generated molecule is trying to get placed into the grid, if the bin corresponding to the new molecule is not filled, the new molecule is immediately placed into that bin\. If it is occupied, the new molecule replaces the current molecule in the bin only if it has a higher fitness, calculated by theΦ\(m\)\\Phi\(m\)formula described above\.

On initialization of the algorithm, we sample 40 molecules from ZINC 250K, and place them uniformly at random across the 4 islands\. Then, each island undergoes 10 independent molecule generations\. After all islands complete their generations, a migration event occurs; we sample 2 molecules from each island uniformly at random, then send a copy of those molecules to another island, also chosen uniformly\. Whether or not those migrated molecules are accepted into the island depends on the bin they land into and the fitness competition described above\. Following ShinkaEvolve, we do not allow the absolute highest fitness molecule from each island to migrate, aiming to preserve some level of diversity between the islands\. After 1000 total binding affinity oracle evaluations, the algorithm terminates, and all generated molecules \(including ones discarded due to losing to fitness competition\) are returned for downstream evaluation\.

We do not implement the novelty rejection\-sampling or the LLM ensemble described in ShinkaEvolve for our simplistic implementation\. We note that this algorithm is still largely unexplored for drug discovery problems, and anticipate that there are likely significant gains to be made beyond our simplistic implementation that was designed primarily as a baseline\. We plan to explore further variations of this algorithm for this multi\-objective problem in future work\.

## Appendix EAdditional ABFE Information

### E\.1ABFE Setup

For our ABFE calculations, we utilize the following Binding Affinity ToolBAT\.py\(Heinzelmann & Gilson,[2021](https://arxiv.org/html/2605.12784#bib.bib16)\)repository:[https://github\.com/GHeinzelmann/BAT\.py](https://github.com/GHeinzelmann/BAT.py)\. We simulate using OpenMM and the standard SDR method\. We use the Boltz\-2 predicted ligand pose as the starting pose for the simulation\. Because Boltz\-2 does not take a protein crystal structure as input and makes a prediction based on the given amino acid sequence, we first align the entire predicted Boltz\-2 conformation to the protein crystal structure with ChimeraX\(Pettersen et al\.,[2020](https://arxiv.org/html/2605.12784#bib.bib36)\), then extract only the ligand pose for ABFE\. We observe this alignment to yield an RMSE of under 0\.7 angstroms on average; thus we are comfortable using the aligned ligand pose with the crystal structure in ABFE calculations\. We do not observe frequent steric clashes resulting from this process\.

Our simulation steps parameters for the BAT\.py framework are as follows: eq\_steps1 = 500000 \(Number of steps for equilibration gradual release\) eq\_steps2 = 15000000 \(Number of steps for equilibration after release\)

m\_steps1 = 500000 \(Number of steps per window for component m \(equilibrium\)\) m\_steps2 = 1000000 \(Number of steps per window for component m \(production\)\)

n\_steps1 = 500000 \(Number of steps per window for component n \(equilibrium\)\) n\_steps2 = 1000000 \(Number of steps per window for component n \(production\)\)

e\_steps1 = 250000 \(Number of steps per window for component e \(equilibrium\)\) e\_steps2 = 500000 \(Number of steps per window for component e \(production\)\)

v\_steps1 = 500000 \(Number of steps per window for component v \(equilibrium\)\) v\_steps2 = 1000000 \(Number of steps per window for component v \(production\)\)

On 8 NVIDIA RTX 4090 GPUs, one ABFE calculation typically takes around 12 hours to complete\.

### E\.2Correlation Analysis: ABFE vs Boltz\-2 vs AutoDock

To justify our usage of Boltz\-2 as a primary binding affinity oracle, we provide general analysis of the correlation between Boltz\-2, AutoDock, and the gold\-standard ABFE\. In Figure[5](https://arxiv.org/html/2605.12784#A5.F5), we take 32 compounds for c\-MET, 16 of which are known binders, and 16 of which are presumed inactive binders\. We calculate the ABFE, Boltz\-2, and AutoDock binding affinities for all 32 compounds\. We exclude results for any failed AutoDock or Boltz\-2 runs\.

![Refer to caption](https://arxiv.org/html/2605.12784v1/figures/abfe_vs_docking.png)\(a\)ABFE vs AutoDock scores
![Refer to caption](https://arxiv.org/html/2605.12784v1/figures/abfe_vs_boltz.png)\(b\)ABFE vs Boltz\-2 scores

Figure 5:Comparison of correlation between AutoDock & ABFE and Boltz\-2 & ABFE for 32 known compounds for the c\-MET protein target\. We observe a significantly higher correlation between Boltz\-2 and ABFE as compared to AutoDock\.We see that ABFE and AutoDock docking showr2=0\.09r^\{2\}=0\.09among the 32 compounds, while ABFE and Boltz\-2 showr2=0\.42r^\{2\}=0\.42\. As an oracle nearly 1000x less computationally expensive than ABFE, Boltz\-2 shows exceptional correlation with ABFE, especially in comparison to docking\. Furthermore, we calculate the ROC\-AUC score for Boltz\-2 and docking, to see how well they can separate binders from non\-binders\. Boltz\-2 scores 0\.95 for this metric, while AutoDock scores 0\.84\. Due to computational and time constraints regarding expensive ABFE calculations, we are only able to provide results for the c\-MET target at this time\.

We demonstrate that Boltz\-2 has stronger correlation with the most accurate gold\-standard computational methods for one of our primary binding targets, which motivates us to employ Boltz\-2 as a binding affinity oracle over the current industry\-standard AutoDock, which itself has often been noted for its practical inaccuracy\. We generally observe Boltz\-2 to be approximately a factor of 10 more expensive to run than AutoDock; however, this difference is entirely negligible in comparison to the cost of molecular dynamics methods such as ABFE\.
ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery

Similar Articles

Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design

Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization

Gene Expression-Informed Jointly Controlled Generative Modeling for Precision Molecular Design

Generating Developable 3D Molecules via Pocket-Conditioned Diffusion and Property-Aware Optimization

LLM-Driven Evolutionary Generation of Multi-Objective Bayesian Optimization Algorithms

Submit Feedback

Similar Articles

Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design
Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization
Gene Expression-Informed Jointly Controlled Generative Modeling for Precision Molecular Design
Generating Developable 3D Molecules via Pocket-Conditioned Diffusion and Property-Aware Optimization
LLM-Driven Evolutionary Generation of Multi-Objective Bayesian Optimization Algorithms