DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery
Summary
DrugSAGE is a framework that accumulates and reuses cross-task memory to build state-of-the-art drug discovery models efficiently, outperforming baseline agents by 10-30% on held-out tasks.
View Cached Full Text
Cached at: 05/18/26, 06:41 AM
# DrugSAGE: Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery
Source: [https://arxiv.org/html/2605.15461](https://arxiv.org/html/2605.15461)
Yikun Zhang1,2Xiwei Cheng1,211footnotemark:1Tianyu Liu311footnotemark:1Yuanqi Du4Wengong Jin1,2 1Northeastern University2Broad Institute of MIT and Harvard 3Yale University4Microsoft Research New England
###### Abstract
Building state\-of\-the\-art \(SOTA\) predictive models for drug discovery requires expensive search over tools, architectures, and training strategies\. Current LLM\-based agents can find SOTA solutions through extensive trial and error, but they do not retain the experience accumulated along the way and therefore pay the full search cost on every new task\. We proposeDrugSAGE\(Self\-evolving Agent Experience\), a framework that accumulates and reuses experience across tasks to build SOTA drug discovery models efficiently\.DrugSAGEmaintains a cross\-task memory of verified skills, statistical evidence about effective strategies, and a record of recurring errors and their fixes\. In some cases,DrugSAGEtransfers a working solution directly without test\-time search\. In 33 molecular property prediction tasks,DrugSAGEranks first among nine SOTA agents in a single\-task setting\. With memory accumulated from 16 smaller tasks,DrugSAGEachieves an averaged normalized score of 0\.935 on 17 held\-out tasks in a cross\-task evaluation setting and outperforms all baseline agents by 10\-30% in a zero\-test\-time search regime\. In summary, our work shows the advantage of cross\-task memory for efficient SOTA model development in drug discovery\.
## 1Introduction
Autonomous agents powered by large language models are increasingly capable of performing scientific research tasks end\-to\-end\[[39](https://arxiv.org/html/2605.15461#bib.bib45)\]\. In biomedicine and drug discovery, systems such as Biomni\[[16](https://arxiv.org/html/2605.15461#bib.bib34)\]and STELLA\[[18](https://arxiv.org/html/2605.15461#bib.bib35)\]have demonstrated that LLM\-based agents can automatically build workflows for data processing, feature extraction, and predictive modeling\. More recent efforts such as AutoResearch\[[19](https://arxiv.org/html/2605.15461#bib.bib19)\]go one step further, moving from assembling a functional pipeline to actively searching over model architectures, training recipes, and preprocessing strategies to achieve state\-of\-the\-art \(SOTA\) performance on a given benchmark\. Achieving SOTA accuracy is especially critical for drug discovery because inaccurate predictions translate directly into experimental failures, which costs thousands of dollars and months of effort\.
However, building SOTA models is an expensive search problem\[[22](https://arxiv.org/html/2605.15461#bib.bib46)\]\. The agent must identify relevant tools and codebases from the literature, adapt them to the dataset at hand, tune training configurations and hyperparameters, and iterate through rounds of trial and error, all of which consume substantial compute and token budgets\. This cost grows further when datasets become large, foundation models are expensive to fine\-tune, and a single training run takes hours or days\[[3](https://arxiv.org/html/2605.15461#bib.bib48)\]\. Slow error feedback makes this iterative search loop on which current agents rely increasingly impractical\. Critically, most of the existing agents pay for this full cost independently on every new task, discarding all search experience before the next one begins\. As a result, their ability to achieve SOTA performance has been difficult to scale across thousands of diverse prediction problems in drug discovery\.
In this work, we proposeDrugSAGE\(Self\-evolving AGent Experience\), a framework that accumulates and reuses experience across tasks to build SOTA solution efficiently\. The key observation behindDrugSAGEis that many drug discovery tasks share structural similarities overlooked by current agents\. Predicting solubility, binding affinity, and bioactivity all involve the same input/output structure and differ only in their labels, data set sizes, and evaluation metrics\. When an agent discovers that a particular molecular featurization performs well or a specific training recipe reliably improves a class of models, there is no reason to discard that knowledge\. Therefore,DrugSAGEtreats each task not as an isolated problem, but as an experience that enriches the agent for future tasks\. It maintains a cross\-task memory that accumulates verified skills, statistical evidence about which strategies generally work better, and a record of recurring failure modes and their fixes\. As this memory grows, the agent’s search narrows: on a new task, instead of exploring the full space of tools, architectures, and hyperparameters from scratch,DrugSAGEdraws on prior experience to prioritize approaches that are likely to succeed — and in some cases, transfers a working solution directly with no additional search at all\.
We evaluateDrugSAGEon two benchmarks\[[15](https://arxiv.org/html/2605.15461#bib.bib1),[28](https://arxiv.org/html/2605.15461#bib.bib47)\]with 33 drug\-property prediction tasks spanning absorption, distribution, metabolism, excretion, toxicity, binding, solubility, lipophicity, and bioactivity\. In a single\-task setting,DrugSAGEranks first among eight baseline agents, including autoresearch systems, ML automation agents, and scientific discovery agents\. With cross\-task memory accumulated from 16 smaller tasks,DrugSAGEachieves an average score of 0\.935 on 17 held\-out tasks in a cross\-task setting and outperforms the best baseline by more than 10\-30% in a zero\-test\-time search regime\. In summary, this work makes three key contributions\.
- •We introduceDrugSAGE, an autonomous agent that maintains a cross\-task memory integrated into a different stage of an MCTS\-based search loop, with a formal guarantee that the memory\-augmented selection policy preserves the regret bound of standard UCB\.
- •Unlike previous works such as Autoresearch\[[19](https://arxiv.org/html/2605.15461#bib.bib19)\]that refine user\-provided or top leaderboard models,DrugSAGEautomatically builds a skill library by searching the literature and GitHub repositories\. This broadens the search space and removes the need for a human\-curated starting point\.
- •We show that cross\-task memory enables a zero\-test\-time search regime:DrugSAGE\-ZERO transfers verified solutions to new tasks without test\-time search\. It outperforms all baseline agents by more than 10\-30% even though the baselines perform 20 search iterations at test time\.
## 2Related Work
Automated algorithm discovery agents\.Recent advances in LLMs have enabled agents to automate the model development in machine learning as a search problem over benchmark datasets\. Early attempts solve this problem with retrieval\-augmented generation from existing open\-source models\[[13](https://arxiv.org/html/2605.15461#bib.bib5),[26](https://arxiv.org/html/2605.15461#bib.bib6)\]\. Nevertheless, later work tackles them via searching algorithms\[[17](https://arxiv.org/html/2605.15461#bib.bib7),[35](https://arxiv.org/html/2605.15461#bib.bib8)\]\. More recently, a variety of work improve over previous methods from several dimensions, including introducing agent structures\[[41](https://arxiv.org/html/2605.15461#bib.bib9),[21](https://arxiv.org/html/2605.15461#bib.bib10)\], incorporating better search algorithms\[[5](https://arxiv.org/html/2605.15461#bib.bib13),[11](https://arxiv.org/html/2605.15461#bib.bib12)\]and external knowledge\[[25](https://arxiv.org/html/2605.15461#bib.bib11)\]\.DrugSAGEintroduces the accumulation of structured cross\-task experience that grows as the agent solves successive tasks, enabling later tasks to directly retrieve verified solutions at zero search cost or to warm\-start from empirically validated starting points\.
Agent experience\.A growing body of work equips LLM agents with persistent, evolving memory, which differs primarily in what is stored and how actionable it is\. Episodic memory aids immediate retries with trial\-level records\[[31](https://arxiv.org/html/2605.15461#bib.bib21)\], while other approaches distill non\-executable insights into semantic memory\[[42](https://arxiv.org/html/2605.15461#bib.bib22),[6](https://arxiv.org/html/2605.15461#bib.bib23),[33](https://arxiv.org/html/2605.15461#bib.bib27)\]\. For actionable guidance, systems develop procedural memory by maintaining append\-only skill libraries\[[36](https://arxiv.org/html/2605.15461#bib.bib20)\], inducing reusable workflows\[[38](https://arxiv.org/html/2605.15461#bib.bib25)\], managing full memory lifecycles\[[10](https://arxiv.org/html/2605.15461#bib.bib24)\], or evolving shared knowledge via multi\-agent reflection\[[29](https://arxiv.org/html/2605.15461#bib.bib18)\]\. More recent work explores the accumulation of agent experiences across tasks such as\[[43](https://arxiv.org/html/2605.15461#bib.bib30),[34](https://arxiv.org/html/2605.15461#bib.bib31),[40](https://arxiv.org/html/2605.15461#bib.bib32)\]\.DrugSAGEcombines two complementary memory mechanisms, pairing an executable skill library that expands across tasks with a performance\-grounded memory that deepens through execution, enabling broader search coverage and more targeted reuse over time\.
Scientific discovery agents\.Recent work has begun to instantiate LLMs as scientific discovery agents with their growing capacity to reason over codes, tools, and scientific contexts\. One line of work embeds LLMs in a verifier\-driven hypothesis search loop\. FunSearch\[[30](https://arxiv.org/html/2605.15461#bib.bib14)\]uses LLMs as evolutionary operators in a program evolution loop, while a significant amount of later works extend its capability beyond program discovery\[[37](https://arxiv.org/html/2605.15461#bib.bib3),[32](https://arxiv.org/html/2605.15461#bib.bib2),[27](https://arxiv.org/html/2605.15461#bib.bib15)\]\. Beyond hypothesis search, an alternative perspective is to build agents that orchestrate tools such as Coscientist\[[4](https://arxiv.org/html/2605.15461#bib.bib33)\]and ChemCrow\[[23](https://arxiv.org/html/2605.15461#bib.bib37)\]\. Later work\[[16](https://arxiv.org/html/2605.15461#bib.bib34),[18](https://arxiv.org/html/2605.15461#bib.bib35)\]has furthered agent capabilities to build computational workflows\. More recently, SAGA\[[8](https://arxiv.org/html/2605.15461#bib.bib4)\]has taken a step forward in automatically evolving the objectives in the scientific discovery workflow\. The most closely related work is Agentomics\[[24](https://arxiv.org/html/2605.15461#bib.bib39)\], which builds an end\-to\-end experimentation agent that explores ML modeling strategies for a given biomedical dataset\. In contrast,DrugSAGEformulates this setting as a cross\-task search problem, reusing empirically validated strategies from prior tasks to seed experience\-conditioned MCTS and avoid redundant exploration\.
## 3Methodology
Problem formulation\.We first define the terminology used in this paper\.
- •Atarget taskisτ=\(𝒟train,𝒟val,𝒟test,μ,B\)\\tau=\(\\mathcal\{D\}\_\{\\text\{train\}\},\\mathcal\{D\}\_\{\\text\{val\}\},\\mathcal\{D\}\_\{\\text\{test\}\},\\mu,B\), where𝒟∗\\mathcal\{D\}\_\{\*\}denotes the training, validation, and test data;μ\\muis the task metric, such as AUROC, AUPRC, RMSE, or MAE; andBBis the*search budget*, i\.e\., the number of new candidate solutions the agent may train and validate during iterative search\.
- •Theskill library𝒦\\mathcal\{K\}is a collection of validated, task\-relevant executable skills, including model families, dataset\-specific methods, and training strategies such as featurization, preprocessing, class\-imbalance loss, sampling, tuning\.
- •Anexecutable solutionis a runnable model built from the skills in𝒦\\mathcal\{K\}, possibly with typed edits\. It specifies the entire pipeline \(e\.g\., model architecture, featurization, training procedure\) and is included only if it can be executed in a sandbox\. The*best solution*is a solution that receives the best score on the*validation*set according to a given task metricμ\\mu\.
- •Thesearch treecontains all executable solutions for a given task\. Each*node*in the search tree is a particular instance of a model family\. Each*edge*containstyped editsof its parent node, including changes in model architecture, training objective, data processing, and ensemble strategies\.
Overview\.The goal ofDrugSAGEis to find the best solution within the search budgetBB\. As shown in[Figure˜1](https://arxiv.org/html/2605.15461#S3.F1),DrugSAGEfirst constructs an executable skill library𝒦\\mathcal\{K\}\(§[3\.1](https://arxiv.org/html/2605.15461#S3.SS1)\) by searching the literature\. It then maintains a experience memory𝒵\\mathcal\{Z\}across tasks \(§[3\.2](https://arxiv.org/html/2605.15461#S3.SS2)\)\. WhenB\>0B\>0,DrugSAGEruns the memory\-enhanced Monte Carlo tree search algorithm \(MCTS\) to search for executable solutions, where each step is equipped with one of the memory components \(§[3\.3](https://arxiv.org/html/2605.15461#S3.SS3)\)\. WhenB=0B=0,DrugSAGEperforms memory routing to transfer a verified solution directly without additional search \(§[3\.4](https://arxiv.org/html/2605.15461#S3.SS4)\)\. We call this settingDrugSAGE\-Zero\.
Figure 1:DrugSAGEoverview\.The Explore Agent builds a shared skill library𝒦\\mathcal\{K\}from literature and repositories\. Cross\-task memory𝒵\\mathcal\{Z\}enables two settings: experience\-conditioned MCTS \(B\>0B\>0\), where𝒵\\mathcal\{Z\}guides and is updated by each search step, and zero\-test\-time routing \(B=0B=0\), where a verified solution is retrieved from𝒵\\mathcal\{Z\}without any new search\.### 3\.1From Literature to Executable Skills
A scientific agent needs an open search space, but not an unconstrained one\.DrugSAGEtreats the literature and GitHub repositories as a source of executable search space\. Before budgeted search begins, the Explore Agent adds new methods to the shared skill library𝒦\\mathcal\{K\}\. It follows a discovery, grounding, and validation procedure\. The discovery stage expands the task to multiple expert perspective queries, searches the literature, selects relevant papers with usable repositories, and writes a task memory of candidate methods\. The grounding stage clones selected repositories, extracts an abstract\-syntax\-tree \(AST\) API snapshot of public interfaces, model classes, and usage examples, and asks the LLM to write a groundedSKILL\.mdusing only symbols observed in that snapshot\. The validation stage runs tiered checks, from syntax and import resolution to end\-to\-end execution in a per\-skill sandbox, with automatic repair on failure\. Only validated skills are included in𝒦\\mathcal\{K\}\. In this way, the search space can grow with the field while remaining executable\. Details of the explore agent are provided in the Appendix[H](https://arxiv.org/html/2605.15461#A8)\.
### 3\.2Experience Memory
The agent maintains a persistent memory𝒵=\(ℛ,ℋ,𝒬\)\\mathcal\{Z\}=\(\\mathcal\{R\},\\mathcal\{H\},\\mathcal\{Q\}\)across tasks\. These memories provide reusable cross\-task evidence for the search tree\.
The solution memoryℛ\\mathcal\{R\}records the performance of different models in different tasks\. Each record in this memory describes a node in the search tree, including its model family, model architecture, hyperparameters, training procedure, data processing, training objectives, task description, and its performance on the validation set\.ℛ\\mathcal\{R\}is updated whenever a new solution is added to the search tree\. This memory helpsDrugSAGEidentify the most promising solutions to explore\.
The refinement memoryℋ\\mathcal\{H\}stores the information associated with each edge in the search tree\. Each record contains the description of the parent and child nodes, task description, LLM\-proposed edits \(e\.g\., changes of model architecture, training objective, data processing\); rationale \(LLM\-generated reasons why proposed edits are beneficial\); and the validation performance difference between the parent and child\.ℋ\\mathcal\{H\}is updated when a child node spawns from a parent with LLM\-generated edits\. This memory helpsDrugSAGEidentify promising optimization strategies, e\.g\., what changes in model architecture, training pipeline, and data processing generally work better than others\.
The execution memory𝒬\\mathcal\{Q\}stores all the logs generated by the sandbox during execution, including failure information, verified fixes, resource profiles, environment traces, and sandbox logs\. When the execution of a candidate solution fails, the sandbox matches the error against known failure signatures and applies a verified fix immediately if one exists\. Resource profiles keep track of the average runtime and memory of this model family, which is useful for preventing out\-of\-memory or timeout failures\. After each execution, new failure information, fixes, and resource profiles are appended to𝒬\\mathcal\{Q\}\. This memory enablesDrugSAGEto apply quick fixes when it encounters execution failures\.
### 3\.3Experience\-Memory\-Enhanced MCTS
Figure 2:Experience Memory\.DrugSAGEstores cross\-task experience as𝒵=\(ℛ,ℋ,𝒬\)\\mathcal\{Z\}=\(\\mathcal\{R\},\\mathcal\{H\},\\mathcal\{Q\}\)\. Solution memoryℛ\\mathcal\{R\}stores family priors and verified best solutions for root\-family and parent\-solution selection\. Refinement memoryℋ\\mathcal\{H\}stores proposal rationales and typed edit effects for grounding child\-solution proposals\. Execution memory𝒬\\mathcal\{Q\}stores script errors, environment failures, verified fixes, and resource profiles for sandbox repair and resource\-aware execution\. After each execution, outcomes, edits, rationales, logs, fixes, and resource usage are written back to memory, converting prior train/eval runs into reusable cross\-task evidence\.DrugSAGEsearches over a task\-level solution forest\. Each root is a baseline executable solution obtained by skill from𝒦\\mathcal\{K\}on the target task\. Each edge is a typed edit or composition operation, and each node is a concrete executable solution with execution status, metric record, lineage, and artifacts\. The search loop alternates between four stages: screening roots, refining completed parents, proposing executable edits, and executing or repairing the resulting candidates\. Experience memory𝒵=\(ℛ,ℋ,𝒬\)\\mathcal\{Z\}=\(\\mathcal\{R\},\\mathcal\{H\},\\mathcal\{Q\}\)enters this loop through four distinct interfaces, illustrated in[Figure˜2](https://arxiv.org/html/2605.15461#S3.F2)\.
Step 1: selecting model families using the solution memoryℛ\\mathcal\{R\}\.The screening step allocates the budget to different model families based on their performance on historical tasks\. Specifically,DrugSAGEselects the next eligible familyffusing a memory\-augmented UCB rule:
ft=argmaxf∈ℱeligibleexploitt\(f\)⏟target task\+αln\(t\+1\)nt\(f\)⏟exploration\+transfer\(f\)⏟historical tasks,f\_\{t\}=\\arg\\max\_\{f\\in\\mathcal\{F\}\_\{\\text\{eligible\}\}\}\\underbrace\{\\mathrm\{exploit\}\_\{t\}\(f\)\}\_\{\\text\{target task\}\}\+\\underbrace\{\\alpha\\sqrt\{\\tfrac\{\\ln\(t\+1\)\}\{n\_\{t\}\(f\)\}\}\}\_\{\\text\{exploration\}\}\+\\underbrace\{\\mathrm\{transfer\}\(f\)\}\_\{\\text\{historical tasks\}\},\(1\)wherettindexes the screening step;nt\(f\)n\_\{t\}\(f\)\(≥1\\geq 1\) counts the number of visits of the model familyff;exploitt\(f\)∈\[0,1\]\\mathrm\{exploit\}\_\{t\}\(f\)\\in\[0,1\]records the best performance of any node in the model family on the target task, penalized for instability and overfit;transfer\(f\)\\mathrm\{transfer\}\(f\)computes the weighted average performance of nodes in familyffover historical tasks, using the records retrieved from the solution memory\. We formally definetransfer\(f\)\\mathrm\{transfer\}\(f\)in Appendix[I](https://arxiv.org/html/2605.15461#A9)\. Intuitively, whent=0t=0, the agent tends to choose the model family with the best performance on historical tasks\. As the agent explores more solutions on the target task, the agent relies more on performance on the current task \(exploitt\(f\)\\mathrm\{exploit\}\_\{t\}\(f\)\) rather than the performance on the historical tasks \(transfer\(f\)\\mathrm\{transfer\}\(f\)\)\. Moreover, we formally show that introducing this bias term does not change the regret bound of standard UCB:
###### Theorem 3\.1\.
Assuming bounded historical performance\|transfer\(f\)\|≤C0\|\\mathrm\{transfer\}\(f\)\|\\leq C\_\{0\}, and standard asymptotic convergence of the exploit estimate, the policy in[Equation˜1](https://arxiv.org/html/2605.15461#S3.E1)attains cumulative regret
RT≤∑f∈ℱ:Δf\>0\(8\(α2\+C02\)lnTΔf\+\(1\+π23\)Δf\),R\_\{T\}\\;\\leq\\;\\sum\_\{\\begin\{subarray\}\{c\}f\\in\\mathcal\{F\}:\\Delta\_\{f\}\>0\\end\{subarray\}\}\\\!\\left\(\\frac\{8\(\\alpha^\{2\}\+C\_\{0\}^\{2\}\)\\ln T\}\{\\Delta\_\{f\}\}\+\\left\(1\+\\tfrac\{\\pi^\{2\}\}\{3\}\\right\)\\Delta\_\{f\}\\right\),\(2\)matching theO\(\|ℱ\|logT\)O\(\|\\mathcal\{F\}\|\\log T\)order of standard UCB\.
Proof: We use the boundedness and asymptotic consistency oftransfer\(f\)\\mathrm\{transfer\}\(f\), follow the proof steps in UCB1 and AM\-GM inequality to prove it\. Details are in Appendix[J\.2\.3](https://arxiv.org/html/2605.15461#A10.SS2.SSS3)\.
Step 2: sampling parent solutions using the solution memoryℛ\\mathcal\{R\}\.Once the roots are selected, the decision shifts from family selection to choosing a specific solution in the model family to expand\. Let𝒱\\mathcal\{V\}be a pool of completed, non\-ensemble nodes with finite primary metrics\. For eachvi∈𝒱v\_\{i\}\\in\\mathcal\{V\}, letqiq\_\{i\}denote its performance on the target task andci=\|children\(vi\)\|c\_\{i\}=\|\\mathrm\{children\}\(v\_\{i\}\)\|its expansion count\. The parent sampler assigns the following weight to each candidate solution:
wi=σ\(βqi−median\(q\)max\(MAD\(q\),ϵ\)\)⏟target task performance⋅11\+ci⏟breadth⋅\(1\+λ⋅transfer\(vi\)\)⏟historical task performance,p\(vi\)=wi∑jwj,w\_\{i\}=\\underbrace\{\\sigma\\\!\\left\(\\beta\\,\\frac\{q\_\{i\}\-\\mathrm\{median\}\(q\)\}\{\\max\(\\mathrm\{MAD\}\(q\),\\epsilon\)\}\\right\)\}\_\{\\text\{target task performance\}\}\\cdot\\underbrace\{\\frac\{1\}\{1\+c\_\{i\}\}\}\_\{\\text\{breadth\}\}\\cdot\\underbrace\{\\bigl\(1\+\\lambda\\cdot\\mathrm\{transfer\}\(v\_\{i\}\)\\bigr\)\}\_\{\\text\{historical task performance\}\},\\quad p\(v\_\{i\}\)=\\frac\{w\_\{i\}\}\{\\sum\_\{j\}w\_\{j\}\},\(3\)whereσ\(x\)=\(1\+e−x\)−1\\sigma\(x\)=\(1\+e^\{\-x\}\)^\{\-1\},MAD\(q\)\\mathrm\{MAD\}\(q\)is the median absolute deviation across𝒱\\mathcal\{V\},ϵ\>0\\epsilon\>0guards against vanishing denominator, andtransfer\(vi\)\\mathrm\{transfer\}\(v\_\{i\}\)is the node\-level analog oftransfer\(f\)\\mathrm\{transfer\}\(f\)\(Appendix[I](https://arxiv.org/html/2605.15461#A9)\)\. The three factors balance target task performance, expansion breadth \(discouraging overly\-expanded parents\), and historical performance on related tasks\.
Step 3: generating child solutions using the refinement memoryℋ\\mathcal\{H\}\.After a parent is sampled,DrugSAGEasks the LLM to propose a typed edit\. The prompt is grounded byℋ\\mathcal\{H\}, which retrieves relevant rationales, previous attempts, successful recipes, and recent trajectory summaries for the current task and parent lineage\. The generated edit must parse into the expected schema and pass duplicate and feasibility checks\. If the proposal is empty, malformed, or nearly a duplicate of a previous child,DrugSAGEresamples a parent or retries proposal generation\. After execution, the proposed edit, rationale, and the observed outcome are written back intoℋ\\mathcal\{H\}\.
Step 4: construct environments to execute solutions using the execution memory𝒬\\mathcal\{Q\}\.A proposed child is materialized as an executable program and evaluated in a sandbox\. Before execution,DrugSAGEautomatically prepares a skill\-specific runtime: it resolves dependencies, creates or reuses an isolated environment, installs missing packages when needed, verifies imports with smoke tests, and sets timeout and resource limits using profiles from𝒬\\mathcal\{Q\}\. If setup or execution fails, the repair loop queries𝒬\\mathcal\{Q\}for a verified fix matching the observed error; matched fixes are applied immediately and retried without additional LLM debugging\. Otherwise,DrugSAGEescalates to dependency repair, script patching, program regeneration, or environment rebuild\. Successful fixes are marked verified; the final status, metrics, logs, resource trace, and repair outcome are written back to𝒬\\mathcal\{Q\}\.
Step 5: update memories\.Every executed candidate produces a result bundle that updates both the task\-level forest and persistent memory\.ℛ\\mathcal\{R\}receives the score, status, and the description of generated solutions\.ℋ\\mathcal\{H\}receives the proposal rationale and trace of each LLM\-guided child expansion\.𝒬\\mathcal\{Q\}receives logs recording failures, repair, and resource profiles of all executed experiments\. In this way, the same execution advances the current task and also becomes reusable evidence for later tasks\.
### 3\.4DrugSAGE\-zero: Zero\-Test\-Time Search by Memory Routing
Normally,DrugSAGEexplores candidate skills from the literature, optimizes them through iterative refinement, and returns the best solution found within a given experiment budget\. However, this search process often consumes substantial compute and token budgets\. With cross\-task memory, the agent can draw on the memory accumulated from previous tasks to narrow or bypass the search entirely by reusing strategies that have proven effective in similar tasks\. WhenB=0B=0,DrugSAGE\-zerotransfers a verified solution directly without launching any additional search\.
Cross\-task memory routing\. This is the key toDrugSAGE\-zero’s zero\-test\-time search capability\. For a target taskτ=\(𝒟,μ,B\)\\tau=\(\\mathcal\{D\},\\mu,B\)with dataset descriptiondτd\_\{\\tau\}, the agent forms a task signature
ϕ\(τ\)=\(type\(τ\),μ,log\|𝒟\|,𝐞\(dτ\)\),\\phi\(\\tau\)=\\bigl\(\\mathrm\{type\}\(\\tau\),\\;\\mu,\\;\\log\|\\mathcal\{D\}\|,\\;\\mathbf\{e\}\(d\_\{\\tau\}\)\\bigr\),\(4\)where𝐞\(dτ\)\\mathbf\{e\}\(d\_\{\\tau\}\)is an embedding of the task description\. Using this signature, the agent identifiesanalog taskswith similar task type and evaluation metrics as the target metric\.DrugSAGE\-zeroranks these analog tasks based on the proximity in training set sizelog\|𝒟train\|\\log\|\\mathcal\{D\}\_\{\\text\{train\}\}\|and the cosine similarity of task description embeddings𝐞\(dτ\)\\mathbf\{e\}\(d\_\{\\tau\}\)\. Lastly, the agent retrieves all the solutions developed for the closest analog task and rank them based on their performance\.DrugSAGE\-zeroreturns the best solution in the analog task and deploys it in the target task\.
## 4Experiments
Our experiments evaluateDrugSAGEin two settings: \(1\) we runDrugSAGEin a single\-task setting, where agents need to develop a SOTA solution from scratch, without relying on cross\-task experience\. This setting aims to evaluateDrugSAGE’s basic problem\-solving and coding ability and properly compare our method with previous AI agent frameworks that do not have cross\-task memory; \(2\) we runDrugSAGEin a cross\-task setting, allowing the agent to transfer useful experience from previous tasks to reduce the search budget required for a new task or directly transfers a solution without test\-time search\.
### 4\.1Evaluation ofDrugSAGEin the single\-task setting
Benchmark tasks\.To evaluate whetherDrugSAGEcan build SOTA solutions from scratch, we collected 22 molecular property prediction tasks from Therapeutic Data Commons \(TDC\)\[[15](https://arxiv.org/html/2605.15461#bib.bib1)\]\. We chose TDC because it has a public leader board of the best solutions curated by human developers, which is important for us to verify if any discovered solution is SOTA\. The 22 tasks span a variety of properties critical for drug discovery, including Absorption \(6\), Distribution \(3\), Metabolism \(6\), Excretion \(3\), and Toxicity \(4\), with training set sizes ranging from 475 to 13,130\.
Baselines\.Our baselines span two distinct paradigms, and we construct a fair comparison for each\. The first paradigm consists of*optimization\-only agents*that refine existing algorithms rather than building a pipeline from scratch: Autoresearch\[[19](https://arxiv.org/html/2605.15461#bib.bib19)\]and ShinkaEvolve\[[20](https://arxiv.org/html/2605.15461#bib.bib17)\], both run by anchoring on the top\-three open\-sourced TDC leaderboard models\. For fair comparison, we includeDrugSAGE\-anchor, which searches from the same starting points as optimization\-only agents\. The second paradigm consists of agents that develop a full solution pipeline from scratch: two*ML\-automation agents*, MLEvolve\[[11](https://arxiv.org/html/2605.15461#bib.bib12)\]and AIRA\-Dojo\[[35](https://arxiv.org/html/2605.15461#bib.bib8)\]; one*general\-purpose coding agent*, Claude Code\[[1](https://arxiv.org/html/2605.15461#bib.bib50)\]; three*scientific\-discovery agents*, Biomni\[[16](https://arxiv.org/html/2605.15461#bib.bib34)\]111Biomni results are provided by the Biomni team from their beta platform and are not independently reproduced by us\., STELLA\[[18](https://arxiv.org/html/2605.15461#bib.bib35)\], and Agentomics\[[24](https://arxiv.org/html/2605.15461#bib.bib39)\]\. All from\-scratch agents, includingDrugSAGE, receive the same shared task prompt \([Appendix˜C](https://arxiv.org/html/2605.15461#A3)\)\. We also include TDC\-Leaderboard as a human\-curated reference\. All agents are powered by Claude\-sonnet\-4\.6, except that Biomni and STELLA use their own default API configurations because their implementation does not support Claude\.
Protocol\.We evaluatedDrugSAGEon all 22 TDC datasets independently, so every task searches for the SOTA solution from scratch\. We use the official 5\-seed train\-validation\-test split and collect each task’s native metric, so our scores are directly comparable to public TDC leaderboard entries\. All agents selects best solutions by validation performance and we report the average test metrics\. We use min\-max normalization to compute a normalized score:scorem,d=\(xm,d−mind\)/\(maxd−mind\)\\text\{score\}\_\{m,d\}=\(x\_\{m,d\}\-\\min\_\{d\}\)/\(\\max\_\{d\}\-\\min\_\{d\}\)for metrics that are higher the better andscorem,d=\(maxd−xm,d\)/\(maxd−mind\)\\text\{score\}\_\{m,d\}=\(\\max\_\{d\}\-x\_\{m,d\}\)/\(\\max\_\{d\}\-\\min\_\{d\}\)for metrics that are lower the better\. We report the average normalized score for all 22 tasks in[Table˜1](https://arxiv.org/html/2605.15461#S4.T1)\.
Results\.[Table˜1](https://arxiv.org/html/2605.15461#S4.T1)evaluates from\-scratch search without cross\-task memory\. FullDrugSAGEachieves the best average rank \(1\.951\.95\) and the most wins \(10/2210/22\)\. To control for the search space,DrugSAGE\-anchor disables the Explore Agent and uses the same top\-3 TDC leaderboard models as Autoresearch and ShinkaEvolve\. Under this matched setting,DrugSAGE\-anchor still reaches an average rank of2\.592\.59and7/227/22wins, outperforming Autoresearch \(4\.824\.82,2/222/22\) and ShinkaEvolve \(4\.954\.95,0/220/22\)\. The gap between fullDrugSAGEandDrugSAGE\-anchor measures the benefit of automatically building the executable skill library\.
Table 1:Results on the first scenario, where each method searches for SOTA solutions from scratch\. Expl\. and Mem\. indicate whether the system supports exploration and memory through its standard interface\. Avg Rank per Category is averaged over each subset of datasets\. Norm\. Score is the average min\-max\-normalized score over 22 datasets\. \#Wins/Tot\. counts datasets ranked first\. Per\-task absolute metrics \(mean±\\pmstd over 5 seeds\) are reported in[Appendix˜D](https://arxiv.org/html/2605.15461#A4)\.CapabilityAvg Rank per Category↓\\downarrowOverallMethodExpl\.Mem\.Abs \(6\)Dist \(3\)Meta \(6\)Excr \(3\)Tox \(4\)Avg Rank↓\\downarrowNorm\. Score↑\\uparrow\#Wins/Tot\.↑\\uparrowTDC referenceTDC Leaderboard Official––2\.835\.333\.172\.333\.003\.230\.8782/22Optimization\-only agents \(top\-3 leaderboard model anchored\)Autoresearch✗✓5\.674\.674\.174\.335\.004\.820\.7412/22ShinkaEvolve✗✓5\.334\.005\.335\.334\.254\.950\.7310/22DRUGSAGE\-anchor mode✗✓2\.173\.002\.673\.002\.502\.590\.8757/22ML automation agentsMLEvolve✓✓7\.507\.676\.838\.678\.257\.640\.5670/22AIRA\-dojo✓✓7\.838\.008\.007\.006\.757\.590\.6150/22General coding agentsClaude Code✓✓7\.175\.676\.675\.007\.756\.640\.6460/22Scientific discovery agentsBiomni✓✗10\.5010\.679\.1711\.0010\.7510\.270\.0760/22STELLA✓✓7\.338\.008\.677\.677\.007\.770\.5501/22Agentomics✓✓8\.178\.008\.678\.678\.758\.450\.5440/22OursDrugSAGE✓✓1\.331\.002\.673\.001\.751\.950\.92910/22
### 4\.2Evaluation ofDrugSAGEin the cross\-task setting
Experience pool and held\-out benchmark tasks\.In this scenario, we want to test whether the memory𝒵\\mathcal\{Z\}can transfer experience to a new target task and eliminate the need for test\-time search\. For this purpose, we partition the 22 TDC datasets by training\-set size: the 16 smallest tasks \(<5,000<5\{,\}000samples\) form the*experience pool*, from whichDrugSAGEbuilds the cross\-task memory𝒵\\mathcal\{Z\}and the skill library𝒦\\mathcal\{K\}; the rest of the six largest tasks serve as held\-out tasks\. This size\-based split allowsDrugSAGE\-Zeroto gain experience on small tasks before the agent encounters the larger targets, where each trial is more expensive\. In addition, we collect 11 tasks from the Polaris Hub\[[2](https://arxiv.org/html/2605.15461#bib.bib40)\], including six ADMET tasks fromFanget al\.\[[9](https://arxiv.org/html/2605.15461#bib.bib41)\]and five kinase\-inhibition tasks based on the PKIS2 data ofDrewryet al\.\[[7](https://arxiv.org/html/2605.15461#bib.bib42)\]\. These 11 Polaris tasks form a separate held\-out benchmark set used only at evaluation time and never entering𝒵\\mathcal\{Z\}\. The two held\-out sets together give 17 tasks on which we measure cross\-task memory transfer\.
Setup\.According to the zero\-test\-time routing regime defined in §[3\.4](https://arxiv.org/html/2605.15461#S3.SS4),DrugSAGE\-zerotransfers a verified solution from memory withB=0B=0, without test\-time search on the held\-out tasks\. Baseline agents also have their own memory mechanisms, but they do not maintain the cross\-task memory studied here\. We therefore report their best\-so\-far performance overB∈\{1,…,20\}B\\in\\\{1,\\ldots,20\\\}search steps under each agent’s native budget definition\. All methods are evaluated on the same held\-out targets using the benchmark metrics\.
Figure 3:Cross\-task efficiency on held\-out tasks\.\(a\)Six TDC\-ADMET held\-out tasks\.\(b\)Eleven Polaris held\-out tasks\. Scores are min\-max normalized per task before averaging; higher is better\. The yellow dashed line isDrugSAGE\-zerowithB=0B=0, and the green curve isDrugSAGEwith the same memory and up to2020target\-task search steps\. TDC\-ADMET and Polaris absolute metrics and per\-task curves are in[Appendix˜E](https://arxiv.org/html/2605.15461#A5)and[Appendix˜F](https://arxiv.org/html/2605.15461#A6)\.Results\.[Figure˜3](https://arxiv.org/html/2605.15461#S4.F3)comparesDrugSAGE\-Zeroagainst baseline agents with increasing search budgets\. On the six held\-out TDC\-ADMET tasks,DrugSAGE\-zeroreaches an average normalized performance of0\.900\.90, exceeding the strongest baseline atB=20B=20\. On the 11 held\-out Polaris tasks, it reaches0\.970\.97, again outperforming all baselines using their full search budget\. In both benchmarks,DrugSAGE\-zerooutperforms the baselines with more than 10\-30% gain\.
The Polaris results test whether zero\-test\-time routing works beyond the TDC benchmark, which is partly used to build memory\.[Table˜2](https://arxiv.org/html/2605.15461#S4.T2)shows the performance \(without normalization\) for each TDC held\-out tasks\. We find thatDrugSAGE\-zeroachieves competitive performance than top TDC leaderboard models andDrugSAGEwithB=20B=20test\-time search\. In some cases like Solubility and CYP2D6 Inhibition,DrugSAGE\-zeroeven outperformsDrugSAGE\(0\.6860\.686vs\.0\.7220\.722MAE;0\.7790\.779vs\.0\.7240\.724AUPRC\)\. LD50 is the only target whereDrugSAGE\-zerounderperforms by a non\-trivial margin \(0\.5470\.547vs\.0\.5020\.502MAE\), but it stays close to the best leaderboard model\. In summary, these results show that experience accumulated on the historical tasks transfers to held\-out tasks and can provide a high\-quality solution for related problems without test\-time search\. Code of solution and analysis are shown in[Section˜G\.3](https://arxiv.org/html/2605.15461#A7.SS3)\.
Table 2:The cross\-task performance \(without normalization\) ofDrugSAGE\-zeroon six TDC held\-out tasks\. Full cross\-task results for all individual tasks are in[Table˜8](https://arxiv.org/html/2605.15461#A7.T8)and[Table˜9](https://arxiv.org/html/2605.15461#A7.T9)\.
### 4\.3Ablation Study
Benefit of the explore agent\. In[Table˜1](https://arxiv.org/html/2605.15461#S4.T1), we compare the performance ofDrugSAGEagainstDrugSAGE\-anchor where explore agent is disabled and replaced with top TDC leaderboard models\. We find thatDrugSAGEoutperformsDrugSAGE\-anchor \(normalized score 0\.929 vs 0\.875\), proving the benefit of skill library automatically constructed by the explore agent\.
LLM cost\.[Figure˜4](https://arxiv.org/html/2605.15461#S4.F4)\(a\) compares the total LLM API cost on the six held\-out TDC\-ADMET tasksB=20B=20\. Compared with general coding and optimization baselines,DrugSAGEvariants use substantially lower LLM cost\. In particular,DrugSAGE\-zeroperforms no LLM generation at test time and only calls the lightweighttext\-embedding\-3\-smallmodel for memory routing, resulting in near\-zero task\-level LLM API cost\.
Importance of cross\-task memory\.[Figure˜4](https://arxiv.org/html/2605.15461#S4.F4)\(a, b\) evaluates the contribution of cross\-task experience memory𝒵\\mathcal\{Z\}under the same experimental setting\. From without𝒵\\mathcal\{Z\}, to with𝒬\\mathcal\{Q\}, to full memory, the normalized score increases monotonically, while the cost decreases monotonically, showing that the gain is not only from the base search procedure but also from reusing experience across tasks\.
Impact of analog tasks\. To examine the impact of analog tasks on the performance ofDrugSAGE\-zero, we conduct ablation studies on three target tasks \(CYP2C9, CYP3A4, CYP2D6\) that have tasks in the experience pool belonging to the same category\. We runDrugSAGE\-zerowith these tasks removed from the experience pool\. This forces the agent to select a less related source task from a different task category, often with a different metric\. As shown in[Figure˜4](https://arxiv.org/html/2605.15461#S4.F4)\(c\),DrugSAGE\-zeroremains competitive with the TDC leaderboard best on all three targets:0\.8220\.822versus0\.8200\.820on CYP2C9 inhibition,0\.8920\.892versus0\.8980\.898on CYP3A4 inhibition, and0\.7330\.733versus0\.7280\.728on CYP2D6 inhibition\. Removing same\-category tasks reduces performance only by0\.030\.03–0\.050\.05, but the routed solutions remain close to the leaderboard level whenB=0B=0\. These results show that the performance ofDrugSAGE\-zerois not merely copying exact same\-category analogs on these CYP targets\.
Figure 4:\(a\)Total LLM API cost on the six held\-out TDC\-ADMET tasks\. Compared with baselines,DrugSAGEvariants require substantially lower LLM cost\.DrugSAGE\-zerouses nearly zero LLM cost by reusing cross\-task experience without test\-time search\.\(b\)Performance improves fromDrugSAGEwithout𝒵\\mathcal\{Z\}, toDrugSAGEwith𝒬\\mathcal\{Q\}, and further to fullDrugSAGE, showing each module’s contribution\.\(c\)Ablation study of analog\-task impact on the three target tasks\.
## 5Discussion
In this paper, we introduce an agentic frameworkDrugSAGEfor efficient drug discovery powered by agent\-driven exploration, cross\-task memory, and automatic experiment refinement\. We demonstrate thatDrugSAGEcan leverage the knowledge learned from different tasks to improve algorithm performances compared with other baselines for both in\-distribution and out\-of\-distribution problems\. There are many interesting directions to further improve our framework\. For example, our agentic framework prioritizes skill sets based on prior knowledge such as citation number, GitHub number, etc\. Investigating the possibility of selecting important skills based on the combination of prior knowledge and experiments on datasets could be helpful\. Finally, we test our framework mainly based on drug discovery tasks, but it could be also generalized to other scientific research areas\. We plan to work on these directions in the future\.
## 6Acknowledgments
We gratefully acknowledge support from the Google Research Scholar Award, the NVIDIA Academic Grant Program, the Google TPU Research Cloud Award, and NSF ACCESS\.
## References
- \[1\]Anthropic\(2025\)Claude code documentation\.Note:[https://code\.claude\.com/docs/en/overview](https://code.claude.com/docs/en/overview)Accessed: 2026\-05\-06Cited by:[§C\.5](https://arxiv.org/html/2605.15461#A3.SS5.p1.1),[§4\.1](https://arxiv.org/html/2605.15461#S4.SS1.p2.1)\.
- \[2\]J\. R\. Ash, C\. Wognum, R\. Rodríguez\-Pérez, M\. Aldeghi, A\. C\. Cheng, D\. Clevert, O\. Engkvist, C\. Fang, D\. J\. Price, J\. M\. Hughes\-Oliver,et al\.\(2025\)Practically significant method comparison protocols for machine learning in small molecule drug discovery\.Journal of chemical information and modeling65\(18\),pp\. 9398–9411\.Cited by:[Appendix A](https://arxiv.org/html/2605.15461#A1.p1.1),[§4\.2](https://arxiv.org/html/2605.15461#S4.SS2.p1.5)\.
- \[3\]G\. Bai, Z\. Chai, C\. Ling, S\. Wang, J\. Lu, N\. Zhang, T\. Shi, Z\. Yu, M\. Zhu, Y\. Zhang,et al\.\(2024\)Beyond efficiency: a systematic survey of resource\-efficient large language models\.arXiv preprint arXiv:2401\.00625\.Cited by:[§1](https://arxiv.org/html/2605.15461#S1.p2.1)\.
- \[4\]D\. A\. Boiko, R\. MacKnight, B\. Kline, and G\. Gomes\(2023\)Autonomous chemical research with large language models\.Nature624\(7992\),pp\. 570–578\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p3.1)\.
- \[5\]J\. Chen, B\. D\. Mishra, J\. Nam, R\. Meng, T\. Pfister, and J\. Yoon\(2026\)MARS: modular agent with reflective search for automated ai research\.arXiv preprint arXiv:2602\.02660\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p1.1)\.
- \[6\]M\. Chen, Y\. Li, Y\. Yang, S\. Yu, B\. Lin, and X\. He\(2024\)Automanual: constructing instruction manuals by llm agents via interactive environmental learning\.Advances in Neural Information Processing Systems37,pp\. 589–631\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p2.1)\.
- \[7\]D\. H\. Drewry, C\. I\. Wells, D\. M\. Andrews, R\. Angell, H\. Al\-Ali, A\. D\. Axtman, S\. J\. Capuzzi, J\. M\. Elkins, P\. Ettmayer, M\. Frederiksen,et al\.\(2017\)Progress towards a public chemogenomic set for protein kinases and a call for contributions\.PloS one12\(8\),pp\. e0181585\.Cited by:[§4\.2](https://arxiv.org/html/2605.15461#S4.SS2.p1.5)\.
- \[8\]Y\. Du, B\. Yu, T\. Liu, T\. Shen, J\. Chen, J\. G\. Rittig, K\. Sun, Y\. Zhang, Z\. Song, B\. Zhou,et al\.\(2025\)Accelerating scientific discovery with autonomous goal\-evolving agents\.arXiv preprint arXiv:2512\.21782\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p3.1)\.
- \[9\]C\. Fang, Y\. Wang, R\. Grater, S\. Kapadnis, C\. Black, P\. Trapa, and S\. Sciabola\(2023\)Prospective validation of machine learning algorithms for absorption, distribution, metabolism, and excretion prediction: an industrial perspective\.Journal of Chemical Information and Modeling63\(11\),pp\. 3263–3274\.Cited by:[§4\.2](https://arxiv.org/html/2605.15461#S4.SS2.p1.5)\.
- \[10\]R\. Fang, Y\. Liang, X\. Wang, J\. Wu, S\. Qiao, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang\(2025\)Memp: exploring agent procedural memory\.arXiv preprint arXiv:2508\.06433\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p2.1)\.
- \[11\]S\. Feng, R\. Ma, X\. Yan, Y\. Fan, Y\. Hu, S\. Huang, S\. Zhang, Z\. Cao, T\. Peng, J\. Yuan, Z\. Guo, Z\. Zhong, S\. Du, W\. Wang, J\. Shi, Y\. Zhou, X\. He, Z\. Yu, F\. Yu, B\. Zhan, Q\. Zheng, J\. Wu, M\. Liu, C\. Zhang, S\. Hou, S\. Li, Y\. Jiang, W\. Lou, L\. Wang, Z\. Wang, J\. Wang, W\. Xu, Y\. Deng, D\. Liu, Y\. Wang, W\. Zhang, F\. Ling, S\. Zhang, X\. Wang, S\. Zheng, X\. Huang, S\. Sun, S\. Hu, P\. Ye, C\. Song, B\. Wang, C\. He, Y\. Liu, X\. Li, Q\. Hou, T\. Chen, X\. Yue, B\. Wang, L\. He, D\. Lin, B\. Zhou, B\. Zhang, and L\. Bai\(2026\)InternAgent\-1\.5: a unified agentic framework for long\-horizon autonomous scientific discovery\.arXiv preprint arXiv:2602\.08990\.Cited by:[§C\.3](https://arxiv.org/html/2605.15461#A3.SS3.p1.1),[§2](https://arxiv.org/html/2605.15461#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.15461#S4.SS1.p2.1)\.
- \[12\]W\. Green, J\. Burns, A\. S\. Zalte, C\. Abreu, J\. Sieg, C\. Feldmann, and M\. Mathea\(2026\)Deep learning foundation models from classical molecular descriptors\.Cited by:[§C\.1](https://arxiv.org/html/2605.15461#A3.SS1.p1.1)\.
- \[13\]S\. Guo, C\. Deng, Y\. Wen, H\. Chen, Y\. Chang, and J\. Wang\(2024\)DS\-agent: automated data science by empowering large language models with case\-based reasoning\.InForty\-first International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p1.1)\.
- \[14\]W\. Hoeffding\(1963\)Probability inequalities for sums of bounded random variables\.Journal of the American statistical association58\(301\),pp\. 13–30\.Cited by:[§J\.2\.3](https://arxiv.org/html/2605.15461#A10.SS2.SSS3.3.p3.1)\.
- \[15\]K\. Huang, T\. Fu, W\. Gao, Y\. Zhao, Y\. Roohani, J\. Leskovec, C\. W\. Coley, C\. Xiao, J\. Sun, and M\. Zitnik\(2021\)Therapeutics data commons: machine learning datasets and tasks for drug discovery and development\.Advances in Neural Information Processing Systems\.Cited by:[Appendix A](https://arxiv.org/html/2605.15461#A1.p1.1),[§1](https://arxiv.org/html/2605.15461#S1.p4.1),[§4\.1](https://arxiv.org/html/2605.15461#S4.SS1.p1.1)\.
- \[16\]K\. Huang, S\. Zhang, H\. Wang, Y\. Qu, Y\. Lu, Y\. Roohani, R\. Li, L\. Qiu, G\. Li, J\. Zhang,et al\.\(2025\)Biomni: a general\-purpose biomedical ai agent\.biorxiv\.Cited by:[§1](https://arxiv.org/html/2605.15461#S1.p1.1),[§2](https://arxiv.org/html/2605.15461#S2.p3.1),[§4\.1](https://arxiv.org/html/2605.15461#S4.SS1.p2.1)\.
- \[17\]Z\. Jiang, D\. Schmidt, D\. Srikanth, D\. Xu, I\. Kaplan, D\. Jacenko, and Y\. Wu\(2025\)Aide: ai\-driven exploration in the space of code\.arXiv preprint arXiv:2502\.13138\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p1.1)\.
- \[18\]R\. Jin, Z\. Zhang, M\. Wang, and L\. Cong\(2025\)Stella: self\-evolving llm agent for biomedical research\.arXiv preprint arXiv:2507\.02004\.Cited by:[§C\.6](https://arxiv.org/html/2605.15461#A3.SS6.p1.1),[§1](https://arxiv.org/html/2605.15461#S1.p1.1),[§2](https://arxiv.org/html/2605.15461#S2.p3.1),[§4\.1](https://arxiv.org/html/2605.15461#S4.SS1.p2.1)\.
- \[19\]A\. Karpathy\(2026\)Autoresearch: ai agents running research on single\-gpu nanochat training automatically\.GitHub\.Note:[https://github\.com/karpathy/autoresearch](https://github.com/karpathy/autoresearch)Cited by:[§C\.1](https://arxiv.org/html/2605.15461#A3.SS1.p1.1),[2nd item](https://arxiv.org/html/2605.15461#S1.I1.i2.p1.1),[§1](https://arxiv.org/html/2605.15461#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.15461#S4.SS1.p2.1)\.
- \[20\]R\. T\. Lange, Y\. Imajuku, and E\. Cetin\(2025\)Shinkaevolve: towards open\-ended and sample\-efficient program evolution\.arXiv preprint arXiv:2509\.19349\.Cited by:[§C\.2](https://arxiv.org/html/2605.15461#A3.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.15461#S4.SS1.p2.1)\.
- \[21\]A\. Li, C\. Wu, Z\. Ge, Y\. H\. Chong, Z\. Hou, L\. Cao, C\. Ju, J\. Wu, H\. Li, H\. Zhang,et al\.\(2025\)The fm agent\.arXiv preprint arXiv:2510\.26144\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p1.1)\.
- \[22\]R\. Liaw, E\. Liang, R\. Nishihara, P\. Moritz, J\. E\. Gonzalez, and I\. Stoica\(2018\)Tune: a research platform for distributed model selection and training\.arXiv preprint arXiv:1807\.05118\.Cited by:[§1](https://arxiv.org/html/2605.15461#S1.p2.1)\.
- \[23\]A\. M\. Bran, S\. Cox, O\. Schilter, C\. Baldassari, A\. D\. White, and P\. Schwaller\(2024\)Augmenting large language models with chemistry tools\.Nature machine intelligence6\(5\),pp\. 525–535\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p3.1)\.
- \[24\]V\. Martinek, A\. Gariboldi, D\. Tzimotoudis, M\. Galea, E\. Zacharopoulou, A\. A\. Escudero, E\. Blake, D\. Čechák, L\. Cassar, A\. Balestrucci,et al\.\(2026\)Agentomics: an agentic system that autonomously develops novel state\-of\-the\-art solutions for biomedical machine learning tasks\.bioRxiv,pp\. 2026–01\.Cited by:[§C\.7](https://arxiv.org/html/2605.15461#A3.SS7.p1.8),[§2](https://arxiv.org/html/2605.15461#S2.p3.1),[§4\.1](https://arxiv.org/html/2605.15461#S4.SS1.p2.1)\.
- \[25\]A\. Nadafian, A\. Mohammadshahi, and M\. Yazdani\(2026\)KAPSO: a knowledge\-grounded framework for autonomous program synthesis and optimization\.arXiv preprint arXiv:2601\.21526\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p1.1)\.
- \[26\]J\. Nam, J\. Yoon, J\. Chen, J\. Shin, S\. O\. Arik, and T\. Pfister\(2025\)MLE\-star: machine learning engineering agent via search and targeted refinement\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p1.1)\.
- \[27\]A\. Novikov, N\. Vũ, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. Ruiz, A\. Mehrabian,et al\.\(2025\)Alphaevolve: a coding agent for scientific and algorithmic discovery\.arXiv preprint arXiv:2506\.13131\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p3.1)\.
- \[28\]Polaris\(2025\)Polaris hub\.GitHub\.Note:[https://github\.com/polaris\-hub/polaris](https://github.com/polaris-hub/polaris)Cited by:[§1](https://arxiv.org/html/2605.15461#S1.p4.1)\.
- \[29\]A\. Qu, H\. Zheng, Z\. Zhou, Y\. Yan, Y\. Tang, S\. Y\. Ong, F\. Hong, K\. Zhou, C\. Jiang, M\. Kong,et al\.\(2026\)CORAL: towards autonomous multi\-agent evolution for open\-ended discovery\.arXiv preprint arXiv:2604\.01658\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p2.1)\.
- \[30\]B\. Romera\-Paredes, M\. Barekatain, A\. Novikov, M\. Balog, M\. P\. Kumar, E\. Dupont, F\. J\. Ruiz, J\. S\. Ellenberg, P\. Wang, O\. Fawzi,et al\.\(2024\)Mathematical discoveries from program search with large language models\.Nature625\(7995\),pp\. 468–475\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p3.1)\.
- \[31\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p2.1)\.
- \[32\]P\. Shojaee, K\. Meidani, S\. Gupta, A\. B\. Farimani, and C\. K\. Reddy\(2025\)LLM\-sr: scientific equation discovery via programming with large language models\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p3.1)\.
- \[33\]M\. Suzgun, M\. Yuksekgonul, F\. Bianchi, D\. Jurafsky, and J\. Zou\(2026\)Dynamic cheatsheet: test\-time learning with adaptive memory\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7080–7106\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p2.1)\.
- \[34\]X\. Tang, T\. Qin, T\. Peng, Z\. Zhou, D\. Shao, T\. Du, X\. Wei, P\. Xia, F\. Wu, H\. Zhu,et al\.\(2025\)Agent kb: leveraging cross\-domain experience for agentic problem solving\.arXiv preprint arXiv:2507\.06229\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p2.1)\.
- \[35\]E\. Toledo, K\. Hambardzumyan, M\. Josifoski, R\. HAZRA, N\. Baldwin, A\. Audran\-Reiss, M\. Kuchnik, D\. Magka, M\. Jiang, A\. M\. Lupidi,et al\.\(2025\)AI research agents for machine learning: search, exploration, and generalization in mle\-bench\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§C\.4](https://arxiv.org/html/2605.15461#A3.SS4.p1.1),[§2](https://arxiv.org/html/2605.15461#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.15461#S4.SS1.p2.1)\.
- \[36\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2024\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=ehfRiF0R3a)Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p2.1)\.
- \[37\]H\. Wang, M\. Skreta, C\. T\. Ser, W\. Gao, L\. Kong, F\. Strieth\-Kalthoff, C\. Duan, Y\. Zhuang, Y\. Yu, Y\. Zhu,et al\.\(2025\)Efficient evolutionary search over chemical space with large language models\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p3.1)\.
- \[38\]Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig\(2025\)Agent workflow memory\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=NTAhi2JEEE)Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p2.1)\.
- \[39\]J\. Wei, Y\. Yang, X\. Zhang, Y\. Chen, X\. Zhuang, Z\. Gao, D\. Zhou, G\. Wang, Z\. Gao, J\. Cao,et al\.\(2025\)From ai for science to agentic science: a survey on autonomous scientific discovery\.arXiv preprint arXiv:2508\.14111\.Cited by:[§1](https://arxiv.org/html/2605.15461#S1.p1.1)\.
- \[40\]Y\. Xiao, Y\. Li, H\. Wang, Y\. Tang, and Z\. Z\. Wang\(2025\)Toolmem: enhancing multimodal agents with learnable tool capability memory\.arXiv preprint arXiv:2510\.06664\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p2.1)\.
- \[41\]X\. Yang, X\. Yang, S\. Fang, B\. Xian, Y\. Li, J\. Wang, M\. Xu, H\. Pan, X\. Hong, W\. Liu,et al\.\(2025\)R&d\-agent: automating data\-driven ai solution building through llm\-powered automated research, development, and evolution\.arXiv e\-prints,pp\. arXiv–2505\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p1.1)\.
- \[42\]A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang\(2024\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p2.1)\.
- \[43\]B\. Zheng, M\. Y\. Fatemi, X\. Jin, Z\. Z\. Wang, A\. Gandhi, Y\. Song, Y\. Gu, J\. Srinivasa, G\. Liu, G\. Neubig,et al\.\(2025\)Skillweaver: web agents can self\-improve by discovering and honing skills\.arXiv preprint arXiv:2504\.07079\.Cited by:[§2](https://arxiv.org/html/2605.15461#S2.p2.1)\.
## Appendix ABenchmark Dataset Summary
TableLABEL:tab:datasetsprovides a complete description of all benchmark datasets used in our evaluation, covering 22 ADMET datasets from the Therapeutics Data Commons \(TDC\)\[[15](https://arxiv.org/html/2605.15461#bib.bib1)\]and 11 held\-out test datasets from the Polaris Hub\[[2](https://arxiv.org/html/2605.15461#bib.bib40)\]\.
Table 3:All benchmark datasets\. Size is the total number of items in the training and testing sets\.DatasetDescriptionSizeMetrictdcommons\-caco2Predict intestinal permeability in Caco\-2 cell assay\.906MAEtdcommons\-hiaPredict human intestinal absorption\.578AUROCtdcommons\-pgpClassify P\-glycoprotein \(Pgp\) inhibition for absorption risk assessment\.1,212AUROCtdcommons\-bioavailabilityPredict oral bioavailability\.640AUROCtdcommons\-lipophilicityPredict lipophilicity \(logP\)\.4,200MAEtdcommons\-solubilityPredict aqueous solubility\.9,982MAEtdcommons\-bbb\-martinsClassify blood–brain barrier penetration\.1,975AUROCtdcommons\-ppbrPredict human plasma protein binding rate\.1,797MAEtdcommons\-vdssPredict volume of distribution at steady state \(VDss\)\.1,130Spearmantdcommons\-cyp2c9\-substrateClassify CYP2C9 enzyme substrates for drug metabolism\.666AUPRCtdcommons\-cyp2d6\-substrateClassify CYP2D6 enzyme substrates for drug–drug interaction risk\.664AUPRCtdcommons\-cyp3a4\-substrateClassify CYP3A4 enzyme substrates for drug metabolism\.667AUROCtdcommons\-cyp2c9\-inhibitionClassify CYP2C9 enzyme inhibitors for drug–drug interaction risk\.12,092AUPRCtdcommons\-cyp2d6\-inhibitionClassify CYP2D6 enzyme inhibitors for drug–drug interaction risk\.13,130AUPRCtdcommons\-cyp3a4\-inhibitionClassify CYP3A4 enzyme inhibitors for drug–drug interaction risk\.12,328AUPRCtdcommons\-half\-lifePredict drug half\-life duration\.667Spearmantdcommons\-clearance\-hepatocytePredict hepatocyte intrinsic drug clearance\.1,020Spearmantdcommons\-clearance\-microsomePredict microsomal intrinsic drug clearance\.1,102Spearmantdcommons\-hergClassify hERG potassium channel blockers to assess cardiotoxicity risk\.648AUROCtdcommons\-diliClassify drug\-induced liver injury \(DILI\) risk\.475AUROCtdcommons\-amesClassify mutagenicity via the Ames bacterial reverse mutation assay\.7,255AUROCtdcommons\-ld50Predict acute oral toxicity \(LD50\)\.7,385MAEpolaris\-adme\-fang\-hclintPredict human liver microsomal intrinsic clearance\.2,806Pearsonpolaris\-adme\-fang\-rclintPredict rat liver microsomal intrinsic clearance\.2,779Pearsonpolaris\-adme\-fang\-permPredict MDR1\-MDCK efflux ratio \(permeability\)\.2,403Pearsonpolaris\-adme\-fang\-soluPredict compound solubility using standardised ADME protocols\.1,978Pearsonpolaris\-adme\-fang\-hppbPredict human plasma protein binding\.160Pearsonpolaris\-adme\-fang\-rppbPredict rat plasma protein binding\.135Pearsonpolaris\-pkis2\-egfr\-wt\-regPredict EGFR wild\-type kinase inhibition \(% inhibition\)\.640MSEpolaris\-pkis2\-ret\-wt\-clsClassify RET wild\-type kinase inhibitors for cancer target engagement\.640AUPRCpolaris\-pkis2\-ret\-wt\-regPredict RET wild\-type kinase inhibition \(% inhibition\)\.640MSEpolaris\-pkis2\-kit\-wt\-clsClassify KIT wild\-type kinase inhibitors for cancer target engagement\.640AUPRCpolaris\-pkis2\-kit\-wt\-regPredict KIT wild\-type kinase inhibition \(% inhibition\)\.640MSE
## Appendix BDataset Partition
TableLABEL:tab:partitionlists the experience pool and target set used in the cross\-task amortization experiments \(Section[4\.2](https://arxiv.org/html/2605.15461#S4.SS2)\)\. The split is determined by training\-set size: the 6 largest datasets \(≥5,000\\geq 5\{,\}000training samples; the next\-smallest dataset is Lipophilicity at4,2004\{,\}200\) constitute the target set; the remaining 16 form the experience pool\.
Table 4:Partition of the 33 benchmark datasets into experience pool \(16\) and target set \(17\)\. The target set contains the 6 largest TDC ADMET datasets and 11 Polaris datasets\.RoleDatasetSizeTask typeMetricExperience poolDILI475Binary cls\.AUROCHIA578Binary cls\.AUROCBioavailability640Binary cls\.AUROChERG648Binary cls\.AUROCCYP2D6 Substrate664Binary cls\.AUPRCCYP2C9 Substrate666Binary cls\.AUPRCCYP3A4 Substrate667Binary cls\.AUROCHalf Life667RegressionSpearmanCaco\-2906RegressionMAEClearance \(hepatocyte\)1,020RegressionSpearmanClearance \(microsome\)1,102RegressionSpearmanVDss1,130RegressionSpearmanPgp1,212Binary cls\.AUROCPPBR1,797RegressionMAEBBB1,975Binary cls\.AUROCLipophilicity4,200RegressionMAETarget setAmes7,255Binary cls\.AUROCLD507,385RegressionMAESolubility \(AqSolDB\)9,982RegressionMAECYP2C9 Inhibition12,092Binary cls\.AUPRCCYP3A4 Inhibition12,328Binary cls\.AUPRCCYP2D6 Inhibition13,130Binary cls\.AUPRCPolaris HClint2,806RegressionPearsonPolaris RClint2,779RegressionPearsonPolaris Perm2,403RegressionPearsonPolaris Solu1,978RegressionPearsonPolaris HPPB160RegressionPearsonPolaris RPPB135RegressionPearsonPolaris EGFR WT Reg640RegressionMSEPolaris RET WT Cls640Binary cls\.AUPRCPolaris RET WT Reg640RegressionMSEPolaris KIT WT Cls640Binary cls\.AUPRCPolaris KIT WT Reg640RegressionMSE
## Appendix CPer\-Baseline Protocols and Prompts
This section documents the run protocol, prompt structure, and non\-default hyperparameters for every baseline evaluated in Section[4](https://arxiv.org/html/2605.15461#S4)\. Except for Biomni, all models use the 5\-seed split of train and validation sets for training and the fixedprepare\.pyevaluation pipeline\.
##### Computational setup\.
All agents, includingDrugSAGEand all baselines, use Claude Sonnet 4\.6 as the backbone LLM via the Anthropic API unless the baseline’s original paper specifies a different model \(see per\-baseline notes below\)\. Each experiment is run on a single NVIDIA L40S GPU, wall\-clock budget per agent per task is capped at 24 hours\.
##### Shared task prompt \(from\-scratch baselines\)\.
All baselines except Autoresearch and ShinkaEvolve receive the same per\-dataset task description\. This description specifies \(i\) the task name and task type \(classification or regression\); \(ii\) the primary metric and its optimization direction; \(iii\) the train/validation pool size and held\-out test size; \(iv\) theprepare\.pyAPI contract \(load\_data,load\_seed\_split,evaluate,save\_predictions,SEEDS\); \(v\) the required output format\[result\] METRIC = MEAN \+/\- STD; and \(vi\) constraints forbidding modification ofprepare\.py, test\-label leakage, or package installation outside the pre\-built conda environment\. For Polaris tasks the description is identical in structure, substitutingprepare\_polaris\.pyand the Polaris\-specific metric\. The full template is available in the released codebase\.
### C\.1Autoresearch
Autoresearch\[[19](https://arxiv.org/html/2605.15461#bib.bib19)\]is an automatic ML pipeline optimization agent\. For each TDC dataset it starts from the top\-3 open\-sourced leaderboard models, and for each Polaris dataset it starts from Chemelon\[[12](https://arxiv.org/html/2605.15461#bib.bib44)\]\.
Prompt structure\.Each anchor model directory contains an auto\-generatedCLAUDE\.mdsystem prompt \(one per dataset/model pair\); placeholders below are filled per task\. The pipeline automatically records outcomes inresults\.tsv\.
`Autoresearch / Claude Code system prompt \(CLAUDE\.md, abbreviated\)`
`C\.2 ShinkaEvolve ShinkaEvolve \[20\] is an evolutionary algorithm optimization agent\. It maintains a population of candidate programs, samples parents from the archive, and mutates them via LLM\-generated code blocks\. The starting point of ShinkaEvolve is the same as Autoresearch\. Prompt structure\. The task\-level system message is set via task\_sys\_msg in shinka\_config\.yaml: ShinkaEvolve task system message \(shinka\_config\.yaml\) Mutation prompts are constructed by a PromptSampler: for diff patches \(70% probability\), the prompt appends SEARCH/REPLACE format instructions requesting a unified diff; for full patches \(30%\), it requests a complete rewrite\. Both include the parent program’s code, its performance metrics, and the code and metrics of archive inspiration programs sorted in ascending\-score order\. C\.3 MLEvolve MLEvolve \[11\] is an MLE agent with tree\-search that explores a solution tree via draft–debug–improve cycles\. It generates multiple initial drafts, executes them, and iteratively expands the tree by selecting promising nodes, proposing code improvements, and executing them\. Prompt structure\. Each dataset has a description\.md providing the task prompt\. Internally, MLEvolve has specialized sub\-agents for drafting, debugging, improving, code review, data\-leakage checking, result parsing, and multi\-branch fusion, each with its own prompt template\. MLEvolve task description \(description\.md, abbreviated\) C\.4 AIRA\-Dojo AIRA\-Dojo \[35\] is an ML research agent that iterates through draft, improve, debug, analyze operators via MCTS, each backed by a prompted LLM call that generates a self\-contained Python script\. Prompt structure\. Each operator has a Jinja2\-templated system prompt defined in YAML; the draft operator is shown below\. The improve prompt is similar but also injects the previous solution’s code and execution output\. The debug prompt focuses on fixing a buggy script\. A shared instructions\.txt prepends the benchmark contract \(use prepare\.py, 5 seeds, output format\) to all operator prompts\. Draft complexity is varied across rounds \(simple, normal, complex\)\. AIRA\-Dojo draft operator \(Jinja2 template, abbreviated\) C\.5 Claude Code Claude Code \[1\] is Anthropic’s agentic coding assistant, used as a general\-purpose baseline\. The CLAUDE\.md system prompt is identical to Autoresearch \(see the prompt box in §C\.1\), except the setting of anchor methods\. C\.6 STELLA STELLA \[18\] is a self\-evolving LLM agent for biomedical research\. We run STELLA in its default self\-evolve mode with the shared task prompt\. STELLA prompt \(abbreviated\) C\.7 Agentomics Agentomics \[24\] is a multi\-step ML agent that follows a fixed step sequence per iteration: iteration planning →\\to data exploration →\\to data split →\\to data representation →\\to model architecture →\\to model training →\\to model inference →\\to prediction exploration →\\to validation evaluation\. Prompt structure\. The system prompt is assembled at runtime by prompt\_builder\.py: Agentomics system prompt \(abbreviated\) The per\-iteration user prompt provides the instruction “Develop a machine learning model that generalizes well to new unseen data\.”, workspace rules \(writable directory, read\-only previous iterations\), and references to archived iteration outputs\. C\.8 Per\-Baseline Budget Mapping Figure 3 plots performance against experiment budget BB, but different baselines count their iterations differently\. Table 5 shows how we convert each baseline’s native unit to BB: one unit of BB corresponds to one complete train\-and\-evaluate cycle on the target dataset, so the comparison is approximately compute\-equalized across systems\. Table 5: Mapping from each baseline’s native iteration unit to budget BB\. System One unit of BB corresponds to Autoresearch one revise\-and\-evaluate cycle ShinkaEvolve one mutation round MLEvolve one tree\-search expansion \+ evaluation AIRA\-Dojo one search step Claude Code one user\-agent iteration STELLA one self\-evolve round Agentomics one full ML iteration Ours one full training run on the target task Appendix D Per\-Task Scores for the TDC\-ADMET Benchmark Table 6 reports per\-task results for all 22 TDC\-ADMET benchmark datasets\. Biomni scores are provided by the Biomni team from their beta platform with no standard deviation reported\. All other systems were run by us under the unified evaluation protocol described in Appendix C\. Table 6: Per\-task scores on the TDC\-ADMET benchmark, reported as mean ±\\pm std over 5 seeds\. Top 2 results are highlighted with bold text and underlined text, respectively\. Biomni scores are from the official beta release \(single point, no std\)\. \(↑\\uparrow\) / \(↓\\downarrow\) denotes a larger / smaller number is better\. Appendix E Per\-Task Scores on the Polaris Benchmark Table 7 reports per\-task results on the 11 Polaris hold\-out tasks introduced in Section 4\.2\. Among agents with budgeted search \(B=20B=20\), DrugSAGE ranks first on eight of eleven tasks\. DrugSAGE\-Zero achieves the top score overall on adme\-fang\-hclint and second best results on six of eleven tasks, illustrating that DrugSAGE\-Zero is able to maintain competitive performance at zero experiment budget\. Table 7: Per\-task scores on the 11 Polaris hold\-out datasets, reported as mean ±\\pm std over 5 seeds\. Top 2 results are highlighted with bold text and underlined text, respectively\. \(↑\\uparrow\) / \(↓\\downarrow\) denotes a larger / smaller number is better\. Appendix F Per\-Task Budget Trajectories on the 17 Held\-Out Tasks Figure 3 in the main paper averages best\-so\-far performance over tasks within each evaluation set\. Figure 5 breaks this down and shows the best\-so\-far score on each individual held\-out task as a function of experiment budget BB\. Figure 5: Per\-task budget trajectories on all 17 held\-out tasks: 6 TDC\-ADMET tasks \(top one row\) and 11 Polaris tasks \(bottom two rows\)\. Appendix G Per\-Task Zero\-Test\-Time Search Results This section explains the routing decisions made by DrugSAGE\-Zero on the 17 held\-out tasks\. For each task we report which task in 𝒵\\mathcal\{Z\} the router matched it to, the solution it transferred, and the resulting zero\-shot score \(mean ±\\pm std over 5 seeds\)\. G\.1 TDC\-ADMET Benchmark Held\-Out Tasks Table 8 reports the analog task memory from 𝒵\\mathcal\{Z\} that the router selected, the transferred solution, and its corresponding performance for each of the 6 ADMET target tasks\. Ames matches to BBB\_Martins, both of which are AUROC binary classification tasks, while the biological domains differ \(mutagenicity vs\. blood–brain barrier penetration\)\. LD50 and Solubility both match to Lipophilicity\_AstraZeneca, the largest MAE regression task in 𝒵\\mathcal\{Z\}, and inherit the same three\-model ensemble\. The three CYP inhibition tasks \(CYP2C9, CYP2D6, CYP3A4 Veith\) all match to CYP2C9\_Substrate, despite the task\-type difference \(substrate vs\. inhibition classification\); the router generalizes across this distinction because the metric \(AUPRC\) and molecular domain \(CYP enzyme\) align\. Across all three patterns, the matched task’s best solution transfers without modification, indicating that metric and task\-type alignment in 𝒵\\mathcal\{Z\} may be sufficient for effective zero\-shot transfer even when the biological context does not fully overlap\. Table 8: Zero\-shot routing results on the 6 TDC\-ADMET held\-out tasks, performance reported as mean ±\\pm std over 5 seeds\. Matched task: the task in 𝒵\\mathcal\{Z\} the router selected\. G\.2 Polaris Benchmark Held\-Out Tasks Table 9 reports zero\-shot routing results across the 11 Polaris held\-out tasks\. For the nine regression tasks, neither Pearson nor MSE appears in 𝒵\\mathcal\{Z\}, so the router cannot find an exact metric match and instead selects the most similar pool task by task type and molecular domain\. The two PKIS2 kinase classification tasks both use AUPRC, which is present in 𝒵\\mathcal\{Z\}, and the router matches them directly to CYP substrate tasks on the basis of shared metric and task type, despite the domain gap between kinase inhibition and enzyme substrate activity\. Table 9: Zero\-shot routing results on the 11 Polaris held\-out tasks, performance reported as mean ±\\pm std over 5 seeds\. Matched task: the task in 𝒵\\mathcal\{Z\} the router selected\. Polaris task Metric Matched Task Transferred Solution DrugSAGE\-zero ADME\-Fang regression adme\-fang\-hclint Pearson ↑\\uparrow Lipophilicity\_AstraZeneca minimol \+ chemprop\-rdkit \+ chemprop 0\.7731±0\.00140\.7731\\pm 0\.0014 adme\-fang\-rclint Pearson ↑\\uparrow Lipophilicity\_AstraZeneca minimol \+ chemprop\-rdkit \+ chemprop 0\.7820±0\.00120\.7820\\pm 0\.0012 adme\-fang\-perm Pearson ↑\\uparrow Lipophilicity\_AstraZeneca minimol \+ chemprop\-rdkit \+ chemprop 0\.8618±0\.00470\.8618\\pm 0\.0047 adme\-fang\-solu Pearson ↑\\uparrow CYP2C9\_Substrate\_CarbonMangels maplight\-gnn \+ minimol \+ lantern\-radr\-ensemble 0\.7086±0\.01080\.7086\\pm 0\.0108 adme\-fang\-hppb Pearson ↑\\uparrow Lipophilicity\_AstraZeneca minimol \+ chemprop\-rdkit \+ chemprop 0\.8192±0\.00720\.8192\\pm 0\.0072 adme\-fang\-rppb Pearson ↑\\uparrow Lipophilicity\_AstraZeneca minimol \+ chemprop\-rdkit \+ chemprop 0\.8370±0\.01230\.8370\\pm 0\.0123 PKIS2 kinase regression pkis2\-egfr\-wt\-reg MSE ↓\\downarrow Lipophilicity\_AstraZeneca minimol \+ chemprop\-rdkit \+ chemprop 403\.46±10\.13403\.46\\pm 10\.13 pkis2\-kit\-wt\-reg MSE ↓\\downarrow CYP2C9\_Substrate\_CarbonMangels maplight\-gnn \+ minimol \+ lantern\-radr\-ensemble 786\.63±22\.95786\.63\\pm 22\.95 pkis2\-ret\-wt\-reg MSE ↓\\downarrow CYP2C9\_Substrate\_CarbonMangels maplight\-gnn \+ minimol \+ lantern\-radr\-ensemble 553\.62±27\.85553\.62\\pm 27\.85 PKIS2 kinase classification pkis2\-kit\-wt\-cls AUPRC ↑\\uparrow CYP2D6\_Substrate\_CarbonMangels admetrix 0\.6873±0\.01740\.6873\\pm 0\.0174 pkis2\-ret\-wt\-cls AUPRC ↑\\uparrow CYP2C9\_Substrate\_CarbonMangels maplight\-gnn \+ minimol \+ lantern\-radr\-ensemble 0\.8321±0\.01990\.8321\\pm 0\.0199 G\.3 Zero\-Test\-Time Routing Case Study Figure 6 illustrates a concrete instance in which zero\-test\-time routing produces a better solution than budgeted search from scratch\. Router matches Solubility task to Lipophilicity\_AstraZeneca, the largest MAE regression task in 𝒵\\mathcal\{Z\}, and transfers its best solution without modification\. The transferred solution featurizes molecules with a wide 1900\-dimensional concatenation of minimol, Morgan, RDKit, and MACCS descriptors and aggregates predictions from a 10\-member homogeneous dense ensemble\. The search loop converges on a 4\-architecture heterogeneous ensemble that combines minimol, AttentiveFP, Admetrix, and NovoExpert with LoRA\-ResNet heads, achieving MAE= 0\.722\\,\{=\}\\,0\.722\. The gap suggests that the wide, homogeneous featurization strategy learned on Lipophilicity transfers more effectively to Solubility than the ensemble the agent independently discovers, a pattern consistent with the shared physicochemical nature of the two regression targets\. Zero\-shot routed MAE↓\\downarrow = 0\.686 Input \(1900\-d\) x = concat\(minimol\(512\), Morgan\(1024\), RDKit\(~200\), MACCS\(167\)\) Head & ensemble head = Dense\(2048\) x 4 \+ skip pred = mean\(head\_k\(x\) for k in 1\.\.10\) Best self\-searched MAE↓\\downarrow = 0\.722 Input \(per architecture\) minimol\_comp\.x = minimol\(512\) attentive\_fp\.x = graph admetrix\.x = own pipeline novoexpert\.x = own pipeline Head & ensemble head = LoRA\-ResNet \(rank 8, 3 layers\) pred = mean\(comp\_i\(x\_i\) for 4 archs\) Figure 6: Solubility task case: a zero\-test\-time routing solution routed from Lipophilicity outperforms the best solution DrugSAGE finds by searching Solubility from scratch\. The routed solution combines wide multi\-source features with a 10\-member homogeneous ensemble; the from\-scratch search converges on a 4\-architecture heterogeneous ensemble\. Appendix H Explore Agent Workflow Details This appendix expands the skill\-construction pipeline summarized in Section 3\.1\. DrugSAGE separates skill construction from the online optimization loop\. The Explore Agent runs offline, before budgeted search on a target task, and converts task\-relevant literature into executable skills that can be read by the downstream search system\. Its output is a set of directories skills/\{name\}/, each containing a structured SKILL\.md, an API snapshot, and repository metadata\. Phase 1: LiteratureScout\. The Explore Agent begins from the task name, task description, and dataset metadata\. A single query often retrieves only a narrow view of the relevant literature, so LiteratureScout asks an LLM to rewrite the task from multiple expert perspectives, such as molecular\-property prediction, data modality, model architecture, training objective, domain\-specific constraints, and implementation availability\. These perspective\-specific queries are submitted to heterogeneous literature and code backends, including web search, paper indexes, preprint servers, and repository search\. The returned pool is deduplicated and hard filtered to remove methods that do not match the task, papers without usable implementations, and repositories that cannot be resolved\. The remaining candidates are ranked by combining LLM relevance judgements with explicit implementation signals such as recency, repository activity, documentation quality, and the presence of training or inference examples\. LiteratureScout writes the selected papers, repository links, and short selection rationales to task\_memory\.md and a machine\-readable companion file\. Phase 2: SkillBuilder\. SkillBuilder turns each selected paper–repository pair into an executable skill\. It clones the referenced repository and constructs an abstract\-syntax\-tree \(AST\) snapshot: a structural parse of the source code that records importable modules, public functions and classes, model constructors, and nearby usage examples\. The snapshot is stored with the skill as an implementation contract\. Given the paper summary, repository metadata, and AST snapshot, the LLM writes a structured SKILL\.md describing the task type, dependencies, callable entry points, and a minimal quick\-start example\. The prompt requires executable snippets to use only identifiers observed in the AST snapshot, which targets the main failure mode of LLM\-written integration code: plausible but non\-existent class names, methods, and import paths\. Tiered validation and repair\. A generated skill is not admitted to the shared library immediately\. DrugSAGE validates it through escalating checks\. Tier 0 verifies the skill file itself, including syntax, package identity, dependency metadata, and import resolvability\. Tier 1 performs a dependency dry run of the quick\-start section against the installed package, checking that referenced calls and objects are consistent with the repository snapshot\. Tier 2 executes the quick\-start end\-to\-end in a per\-skill conda sandbox, catching dependency drift, missing data assumptions, and runtime errors\. When a tier fails, SkillBuilder repairs the offending section using the AST snapshot, package metadata, and observed error message, then reruns validation\. Only skills that pass all tiers are admitted to the library 𝒦\\mathcal\{K\} and exposed to the online search loop through the tool catalog\. Appendix I Formal Definition of Cross\-Task Transfer Scores This section provides the formal definition of the per\-task standardized score underlying the cross\-task transfer terms transfer\(f\)\\mathrm\{transfer\}\(f\) \(Equation˜1\) and transfer\(vi\)\\mathrm\{transfer\}\(v\_\{i\}\) \(Equation˜3\) introduced in Section 3\.3\. I\.1 Per\-Task Standardized Score Since historical tasks use heterogeneous metrics whose raw values are not directly comparable, all historical scores are first converted to a bounded, normalized utility score before aggregation\. Each node vv evaluated on a historical task τ\\tau with evaluation metric μτ\\mu\_\{\\tau\} receives a raw score xτ\(v\)x\_\{\\tau\}\(v\)\. We convert it to a standardized score s¯τ\(v\)∈\[−1,1\]\\bar\{s\}\_\{\\tau\}\(v\)\\in\[\-1,1\] in three steps\. Step 1: Direction alignment\. Convert all metrics to a higher\-is\-better orientation: sτ\(v\)=\{xτ\(v\),if μτ is higher\-is\-better,−xτ\(v\),if μτ is lower\-is\-better\.s\_\{\\tau\}\(v\)=\\begin\{cases\}\\phantom\{\-\}x\_\{\\tau\}\(v\),&\\text\{if \}\\mu\_\{\\tau\}\\text\{ is higher\-is\-better\},\\\\\[3\.0pt\] \-x\_\{\\tau\}\(v\),&\\text\{if \}\\mu\_\{\\tau\}\\text\{ is lower\-is\-better\}\.\\end\{cases\} \(5\) Step 2: Within\-task robust normalization\. Normalize sτ\(v\)s\_\{\\tau\}\(v\) relative to all completed nodes VτV\_\{\\tau\} evaluated on task τ\\tau with the median as location and the median absolute deviation \(MAD\) as scale: s~τ\(v\)=sτ\(v\)−medianu∈Vτsτ\(u\)max\(MADu∈Vτsτ\(u\),ϵ\),\\tilde\{s\}\_\{\\tau\}\(v\)=\\frac\{s\_\{\\tau\}\(v\)\-\\operatorname\{median\}\_\{u\\in V\_\{\\tau\}\}\\,s\_\{\\tau\}\(u\)\}\{\\max\\bigl\(\\operatorname\{MAD\}\_\{u\\in V\_\{\\tau\}\}s\_\{\\tau\}\(u\),\\;\\epsilon\\bigr\)\}, \(6\) where MADu∈Vτsτ\(u\)≜medianu∈Vτ\|sτ\(u\)−medianu∈Vτsτ\(u\)\|\\operatorname\{MAD\}\_\{u\\in V\_\{\\tau\}\}s\_\{\\tau\}\(u\)\\triangleq\\operatorname\{median\}\_\{u\\in V\_\{\\tau\}\}\\bigl\|s\_\{\\tau\}\(u\)\-\\operatorname\{median\}\_\{u\\in V\_\{\\tau\}\}s\_\{\\tau\}\(u\)\\bigr\| and ϵ\>0\\epsilon\>0 is a small constant that prevents division by zero when all solutions achieve identical scores\. The choice of median and MAD over mean and standard deviation makes the normalization robust to outlier solutions\. Step 3: Map to \[−1,1\]\[\-1,1\]\. Apply the logistic sigmoid σ\\sigma to map the unbounded s~τ\(v\)\\tilde\{s\}\_\{\\tau\}\(v\) to \[−1,1\]\[\-1,1\]: s¯τ\(v\)=2σ\(s~τ\(v\)\)−1∈\[−1, 1\]\.\\bar\{s\}\_\{\\tau\}\(v\)=2\\,\\sigma\\\!\\bigl\(\\tilde\{s\}\_\{\\tau\}\(v\)\\bigr\)\-1\\;\\in\\;\[\-1,\\,1\]\. \(7\) The monotone sigmoid maps any finite normalized score to the bounded interval, ensuring no single historical task can contribute an unbounded signal to the transfer prior\. I\.2 Task\-Similarity Weights Given a target task τ0\\tau\_\{0\}, the weight w\(τ,τ0\)≥0w\(\\tau,\\tau\_\{0\}\)\\geq 0 measures the relevance of each historical task τ∈𝒵\\tau\\in\\mathcal\{Z\}: w\(τ,τ0\)=wmetric⋅wtype⋅wsize⋅wemb,w\(\\tau,\\tau\_\{0\}\)=w\_\{\\text\{metric\}\}\\cdot w\_\{\\text\{type\}\}\\cdot w\_\{\\text\{size\}\}\\cdot w\_\{\\text\{emb\}\}, \(8\) where each factor captures one dimension of task similarity: • Metric match\. wmetric=𝟏\[μτ=μτ0\]\+δw\_\{\\text\{metric\}\}=\\mathbf\{1\}\[\\mu\_\{\\tau\}=\\mu\_\{\\tau\_\{0\}\}\]\+\\delta rewards exact metric match with a small fallback δ\>0\\delta\>0 for metrics in the same family \(e\.g\., both are correlation\-based\); • Task match\. wtype=𝟏\[type\(τ\)=type\(τ0\)\]w\_\{\\text\{type\}\}=\\mathbf\{1\}\\bigl\[\\mathrm\{type\}\(\\tau\)=\\mathrm\{type\}\(\\tau\_\{0\}\)\\bigr\] matches the task type \(classification vs\. regression\); • Dataset size match\. wsize=exp\(−γ\|log\|𝒟τ\|−log\|𝒟τ0\|\|\)w\_\{\\text\{size\}\}=\\exp\\ \\bigl\(\-\\gamma\\,\\bigl\|\\log\|\\mathcal\{D\}\_\{\\tau\}\|\-\\log\|\\mathcal\{D\}\_\{\\tau\_\{0\}\}\|\\bigr\|\\bigr\) penalizes dataset\-size mismatch on a log scale, with decay rate γ\>0\\gamma\>0; • Description similarity\. wemb=max\(cos\(𝐞τ,𝐞τ0\), 0\)w\_\{\\text\{emb\}\}=\\max\\ \\bigl\(\\cos\(\\mathbf\{e\}\_\{\\tau\},\\,\\mathbf\{e\}\_\{\\tau\_\{0\}\}\),\\;0\\bigr\) is the clamped cosine similarity between task\-description embeddings 𝐞τ\\mathbf\{e\}\_\{\\tau\} obtained from an LLM embedding API\. I\.3 Formal Definition of Transfer Scores Using the standardized scores from Appendix I\.1 and the weights from Appendix I\.2, we define the transfer terms in Equations˜1 and 3 as weighted averages over all historical tasks in 𝒵\\mathcal\{Z\}\. Family\-level transfer\. The transfer score for model family ff aggregates the standardized scores of all solutions in ff across historical tasks: transfer\(f\)=∑τ∈𝒵w\(τ,τ0\)⋅meanv∈f,τs¯τ\(v\)∑τ∈𝒵w\(τ,τ0\),\\mathrm\{transfer\}\(f\)=\\frac\{\\displaystyle\\sum\_\{\\tau\\in\\mathcal\{Z\}\}w\(\\tau,\\tau\_\{0\}\)\\;\\cdot\\;\\operatorname\{mean\}\_\{v\\in f,\\,\\tau\}\\;\\bar\{s\}\_\{\\tau\}\(v\)\}\{\\displaystyle\\sum\_\{\\tau\\in\\mathcal\{Z\}\}w\(\\tau,\\tau\_\{0\}\)\}, \(9\) where the inner mean is over all solutions in family ff evaluated on task τ\\tau\. Node\-level transfer\. For an individual solution viv\_\{i\}, we search for the solution in each historical task τ\\tau that shares the same model family and modification type as viv\_\{i\}, denoted matched\(vi,τ\)\\mathrm\{matched\}\(v\_\{i\},\\tau\), and aggregate its standardized score: transfer\(vi\)=∑τ∈𝒵w\(τ,τ0\)⋅s¯τ\(matched\(vi,τ\)\)∑τ∈𝒵w\(τ,τ0\)\.\\mathrm\{transfer\}\(v\_\{i\}\)=\\frac\{\\displaystyle\\sum\_\{\\tau\\in\\mathcal\{Z\}\}w\(\\tau,\\tau\_\{0\}\)\\;\\cdot\\;\\bar\{s\}\_\{\\tau\}\\\!\\bigl\(\\mathrm\{matched\}\(v\_\{i\},\\tau\)\\bigr\)\}\{\\displaystyle\\sum\_\{\\tau\\in\\mathcal\{Z\}\}w\(\\tau,\\tau\_\{0\}\)\}\. \(10\) I\.4 Boundedness Guarantee Since s¯τ\(v\)∈\[−1,1\]\\bar\{s\}\_\{\\tau\}\(v\)\\in\[\-1,1\] for all vv and τ\\tau by construction \(Equation˜7\), and all weights w\(τ,τ0\)≥0w\(\\tau,\\tau\_\{0\}\)\\geq 0, both transfer scores satisfy \|transfer\(f\)\|≤1\|\\mathrm\{transfer\}\(f\)\|\\leq 1 and \|transfer\(vi\)\|≤1\|\\mathrm\{transfer\}\(v\_\{i\}\)\|\\leq 1\. This directly satisfies the bounded\-transfer assumption \|transfer\(f\)\|≤C0\|\\mathrm\{transfer\}\(f\)\|\\leq C\_\{0\} in Assumption J\.1 of Appendix J with C0=1C\_\{0\}=1, ensuring that the cross\-task prior cannot introduce unbounded bias into the UCB selection policy and the regret order of Theorem J\.5 is preserved\. Appendix J Proof of adding the cross\-task causal\-factor term will not change the upperbound of speed J\.1 Problem Setup We prove that the transfer bonus term transfer\(f\)\\mathrm\{transfer\}\(f\) for family\-level selection and wvw\_\{v\} for node\-level selection in the family\-level selector and node\-level selector, respectively, do not break the sublinear regret guarantee of the underlying UCB1 policy\. In particular, the asymptotic order of the regret bound remains identical to that of standard UCB1 algorithm, and the causal terms affect only the leading constant\. Therefore, our modification does not change the upperbound of complexity after introducing cross\-task memory\. J\.1\.1 Assumptions Assumption J\.1 \(Boundedness\)\. Rewards satisfy r∈\[0,1\]r\\in\[0,1\]\. The clipped causal bonus terms are uniformly bounded: \|transfer\(f\)\|≤C0<∞∀f∈ℱ\|\\mathrm\{transfer\}\(f\)\|\\leq C\_\{0\}<\\infty\\quad\\forall\\,f\\in\\mathcal\{F\} Assumption J\.2 \(Asymptotic Consistency\)\. The CEG estimator is consistent: as n\(f\)→∞n\(f\)\\to\\infty, transfer\(f\)→Δ∗\(f\),\\mathrm\{transfer\}\(f\)\\;\\xrightarrow\{\\;\}\\;\\Delta^\{\*\}\(f\), where Δ∗\(f\)\\Delta^\{\*\}\(f\) is the ground\-truth cross\-task transfer value for family ff\. More precisely, we construct the estimation error satisfies \|transfer\(f\)−Δ∗\(f\)\|≤C0n\(f\)\\bigl\|\\mathrm\{transfer\}\(f\)\-\\Delta^\{\*\}\(f\)\\bigr\|\\;\\leq\\;\\frac\{C\_\{0\}\}\{\\sqrt\{n\(f\)\}\} for some constant C0\>0C\_\{0\}\>0\. Assumption J\.3 \(Optimism\)\. The CEG bonus is \(asymptotically\) optimistic, consistent with the normalized score designed in UCB: transfer\(f\)≥Δ∗\(f\)−ϵt,ϵt→0 as t→∞\.\\mathrm\{transfer\}\(f\)\\;\\geq\\;\\Delta^\{\*\}\(f\)\-\\epsilon\_\{t\},\\quad\\epsilon\_\{t\}\\to 0\\text\{ as \}t\\to\\infty\. J\.2 Family\-Level Selection J\.2\.1 Setup and Regret Definition The family\-level UCB index is ft=argmaxf∈ℱr~\(f\)\+αln\(t\+1\)n\(f\)\+transfer\(f\)\.f\_\{t\}=\\operatorname\*\{arg\\,max\}\_\{f\\in\\mathcal\{F\}\}\\;\\tilde\{r\}\(f\)\+\\alpha\\sqrt\{\\frac\{\\ln\(t\+1\)\}\{n\(f\)\}\}\+\\mathrm\{transfer\}\(f\)\. \(11\) Let f∗=argmaxf𝔼\[r~\(f\)\+Δ∗\(f\)\]f^\{\*\}=\\operatorname\*\{arg\\,max\}\_\{f\}\\,\\mathbb\{E\}\[\\tilde\{r\}\(f\)\+\\Delta^\{\*\}\(f\)\] be the optimal family under the adjusted reward\. Define the sub\-optimality gap Δf=\(r~\(f∗\)\+Δ∗\(f∗\)\)−\(r~\(f\)\+Δ∗\(f\)\)\> 0for f≠f∗\.\\Delta\_\{f\}\\;=\\;\\bigl\(\\tilde\{r\}\(f^\{\*\}\)\+\\Delta^\{\*\}\(f^\{\*\}\)\\bigr\)\-\\bigl\(\\tilde\{r\}\(f\)\+\\Delta^\{\*\}\(f\)\\bigr\)\\;\>\\;0\\quad\\text\{for \}f\\neq f^\{\*\}\. The TT\-round cumulative regret is RT=∑t=1T\[\(r~\(f∗\)\+Δ∗\(f∗\)\)−\(r~\(ft\)\+Δ∗\(ft\)\)\]\.R\_\{T\}=\\sum\_\{t=1\}^\{T\}\\Bigl\[\\bigl\(\\tilde\{r\}\(f^\{\*\}\)\+\\Delta^\{\*\}\(f^\{\*\}\)\\bigr\)\-\\bigl\(\\tilde\{r\}\(f\_\{t\}\)\+\\Delta^\{\*\}\(f\_\{t\}\)\\bigr\)\\Bigr\]\. J\.2\.2 Lemma: Causal Bonus Is Order\-Compatible with UCB Exploration Lemma J\.4 \(Order Compatibility\)\. Under Assumption J\.2, decompose the CEG estimator as transfer\(f\)=Δ∗\(f\)\+ξt\(f\),\|ξt\(f\)\|≤C0n\(f\)\.\\mathrm\{transfer\}\(f\)=\\Delta^\{\*\}\(f\)\+\\xi\_\{t\}\(f\),\\quad\|\\xi\_\{t\}\(f\)\|\\leq\\frac\{C\_\{0\}\}\{\\sqrt\{n\(f\)\}\}\. Then the augmented UCB index \(11\) can be written as UCBt\(f\)=r~\(f\)\+Δ∗\(f\)⏟adjusted true value\+αln\(t\+1\)n\(f\)\+ξt\(f\)⏟confidence term,\\mathrm\{UCB\}\_\{t\}\(f\)=\\underbrace\{\\tilde\{r\}\(f\)\+\\Delta^\{\*\}\(f\)\}\_\{\\text\{adjusted true value\}\}\+\\underbrace\{\\alpha\\sqrt\{\\frac\{\\ln\(t\+1\)\}\{n\(f\)\}\}\+\\xi\_\{t\}\(f\)\}\_\{\\text\{confidence term\}\}, \(12\) where the confidence term satisfies \|αln\(t\+1\)n\(f\)\+ξt\(f\)\|≤αln\(t\+1\)\+C0n\(f\)=O\(lntn\(f\)\)\.\\left\|\\alpha\\sqrt\{\\frac\{\\ln\(t\+1\)\}\{n\(f\)\}\}\+\\xi\_\{t\}\(f\)\\right\|\\;\\leq\\;\\frac\{\\alpha\\sqrt\{\\ln\(t\+1\)\}\+C\_\{0\}\}\{\\sqrt\{n\(f\)\}\}\\;=\\;O\\\!\\left\(\\sqrt\{\\frac\{\\ln t\}\{n\(f\)\}\}\\right\)\. Hence the causal bonus introduces no new asymptotic order\. Proof\. The decomposition follows directly from Assumption J\.2\. The bound on the confidence term follows from the triangle inequality: \|αln\(t\+1\)n\(f\)\+ξt\(f\)\|≤αln\(t\+1\)n\(f\)\+\|ξt\(f\)\|≤αln\(t\+1\)\+C0n\(f\)\.\\left\|\\alpha\\sqrt\{\\frac\{\\ln\(t\+1\)\}\{n\(f\)\}\}\+\\xi\_\{t\}\(f\)\\right\|\\;\\leq\\;\\alpha\\sqrt\{\\frac\{\\ln\(t\+1\)\}\{n\(f\)\}\}\+\|\\xi\_\{t\}\(f\)\|\\;\\leq\\;\\frac\{\\alpha\\sqrt\{\\ln\(t\+1\)\}\+C\_\{0\}\}\{\\sqrt\{n\(f\)\}\}\. Since αln\(t\+1\)\+C0=O\(lnt\)\\alpha\\sqrt\{\\ln\(t\+1\)\}\+C\_\{0\}=O\(\\sqrt\{\\ln t\}\), the entire confidence term is O\(lnt/n\(f\)\)O\\\!\\bigl\(\\sqrt\{\\ln t/n\(f\)\}\\bigr\), matching the standard UCB rate\. ∎ J\.2\.3 Theorem: Family\-Level Regret Bound Theorem J\.5 \(Family\-Level Regret Upper Bound\)\. Under Assumptions J\.1–J\.3, the augmented UCB policy \(11\) achieves cumulative regret RT≤∑f∈ℱΔf\>0\(8\(α2\+C02\)lnTΔf\+\(1\+π23\)Δf\)\.R\_\{T\}\\;\\leq\\;\\sum\_\{\\begin\{subarray\}\{c\}f\\in\\mathcal\{F\}\\\\ \\Delta\_\{f\}\>0\\end\{subarray\}\}\\left\(\\frac\{8\(\\alpha^\{2\}\+C\_\{0\}^\{2\}\)\\ln T\}\{\\Delta\_\{f\}\}\+\\left\(1\+\\frac\{\\pi^\{2\}\}\{3\}\\right\)\\Delta\_\{f\}\\right\)\. \(13\) This is O\(\|ℱ\|lnT\)O\(\|\\mathcal\{F\}\|\\ln T\), identical in asymptotic order to standard UCB; the causal factor only modifies the leading constant via C0C\_\{0\}\. Proof\. We follow the standard UCB analysis and carefully track the causal error term ξt\(f\)\\xi\_\{t\}\(f\)\. Step 1: High\-probability bound on the optimal arm\. By Hoeffding’s inequality \[14\], with probability at least 1−t−41\-t^\{\-4\}, r~\(f∗\)≥r^n\(f∗\)\(f∗\)−2lntn\(f∗\)\.\\tilde\{r\}\(f^\{\*\}\)\\;\\geq\\;\\hat\{r\}\_\{n\(f^\{\*\}\)\}\(f^\{\*\}\)\-\\sqrt\{\\frac\{2\\ln t\}\{n\(f^\{\*\}\)\}\}\. Therefore, using Assumption J\.2 for the CEG term, UCBt\(f∗\)≥r~\(f∗\)\+Δ∗\(f∗\)−C0n\(f∗\)\.\\mathrm\{UCB\}\_\{t\}\(f^\{\*\}\)\\;\\geq\\;\\tilde\{r\}\(f^\{\*\}\)\+\\Delta^\{\*\}\(f^\{\*\}\)\-\\frac\{C\_\{0\}\}\{\\sqrt\{n\(f^\{\*\}\)\}\}\. \(14\) Step 2: Necessary condition for selecting a suboptimal arm\. Arm f≠f∗f\\neq f^\{\*\} is selected at round tt only if UCBt\(f\)≥UCBt\(f∗\)\\mathrm\{UCB\}\_\{t\}\(f\)\\geq\\mathrm\{UCB\}\_\{t\}\(f^\{\*\}\)\. Combining with \(14\) and using the decomposition \(12\): r~\(f\)\+Δ∗\(f\)\+αln\(t\+1\)n\(f\)\+C0n\(f\)≥r~\(f∗\)\+Δ∗\(f∗\)\.\\tilde\{r\}\(f\)\+\\Delta^\{\*\}\(f\)\+\\alpha\\sqrt\{\\frac\{\\ln\(t\+1\)\}\{n\(f\)\}\}\+\\frac\{C\_\{0\}\}\{\\sqrt\{n\(f\)\}\}\\;\\geq\\;\\tilde\{r\}\(f^\{\*\}\)\+\\Delta^\{\*\}\(f^\{\*\}\)\. Rearranging: \(α\+C0ln\(t\+1\)\)ln\(t\+1\)n\(f\)≥Δf\.\\left\(\\alpha\+\\frac\{C\_\{0\}\}\{\\sqrt\{\\ln\(t\+1\)\}\}\\right\)\\sqrt\{\\frac\{\\ln\(t\+1\)\}\{n\(f\)\}\}\\;\\geq\\;\\Delta\_\{f\}\. \(15\) Step 3: Upper bound on arm pull count\. Squaring both sides of \(15\) and solving for n\(f\)n\(f\): n\(f\)≤\(α\+C0/ln\(t\+1\)\)2ln\(t\+1\)Δf2≤2\(α2\+C02\)lnTΔf2,n\(f\)\\;\\leq\\;\\frac\{\\bigl\(\\alpha\+C\_\{0\}/\\\!\\sqrt\{\\ln\(t\+1\)\}\\bigr\)^\{2\}\\ln\(t\+1\)\}\{\\Delta\_\{f\}^\{2\}\}\\;\\leq\\;\\frac\{2\(\\alpha^\{2\}\+C\_\{0\}^\{2\}\)\\ln T\}\{\\Delta\_\{f\}^\{2\}\}, where the last step uses \(α\+C0/ln\(t\+1\)\)2≤2\(α2\+C02\)\(\\alpha\+C\_\{0\}/\\sqrt\{\\ln\(t\+1\)\}\)^\{2\}\\leq 2\(\\alpha^\{2\}\+C\_\{0\}^\{2\}\) by the AM–GM inequality, valid for all t≤Tt\\leq T\. Step 4: Expected number of pulls\. Using the standard UCB analysis \(tail\-sum decomposition\), 𝔼\[NT\(f\)\]≤8\(α2\+C02\)lnTΔf2\+1\+π23\.\\mathbb\{E\}\[N\_\{T\}\(f\)\]\\;\\leq\\;\\frac\{8\(\\alpha^\{2\}\+C\_\{0\}^\{2\}\)\\ln T\}\{\\Delta\_\{f\}^\{2\}\}\+1\+\\frac\{\\pi^\{2\}\}\{3\}\. Step 5: Cumulative regret\. Summing over all suboptimal families: RT=∑f:Δf\>0Δf⋅𝔼\[NT\(f\)\]≤∑f:Δf\>0\(8\(α2\+C02\)lnTΔf\+\(1\+π23\)Δf\)\.R\_\{T\}=\\sum\_\{f:\\,\\Delta\_\{f\}\>0\}\\Delta\_\{f\}\\cdot\\mathbb\{E\}\[N\_\{T\}\(f\)\]\\leq\\sum\_\{f:\\,\\Delta\_\{f\}\>0\}\\left\(\\frac\{8\(\\alpha^\{2\}\+C\_\{0\}^\{2\}\)\\ln T\}\{\\Delta\_\{f\}\}\+\\left\(1\+\\frac\{\\pi^\{2\}\}\{3\}\\right\)\\Delta\_\{f\}\\right\)\. Key observation\. The bound is O\(lnT\)O\(\\ln T\), matching standard UCB\. The causal factor contributes only through C0C\_\{0\}, which modifies the constant but not the asymptotic order\. If we also consider the node\-level regret bound, under the weighted sampler computation in 3, the bonus max\(ΔCEG\(vi\),0\)\\max\(\\Delta^\{\\mathrm\{CEG\}\}\(v\_\{i\}\),0\) now serves as a non\-negative multiplicative determined by wiw\_\{i\}\. It therefore only increases sampling probability for nodes with positive CEG signal, and leaves the original node\-level UCB asymptotic regret order O\(TlnT\)O\(\\sqrt\{T\\ln T\}\) unchanged while potentially improving the leading constant\. ∎`Similar Articles
SPADE: Faster Drug Discovery by Learning from Sparse Data
This paper introduces SPADE, a novel algorithm for drug discovery that efficiently identifies high-quality ligands from sparse data using only ~40 tests. It demonstrates superior sample efficiency and speed compared to deep learning and Bayesian optimization methods.
SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration
Introduces SPO, a stochastic search framework for automatic prompt optimization, with three strategies including SAGE, an agent-guided multi-agent pipeline. Evaluated on benchmarks and deployed on a mental-health chatbot, showing improvements in retention through continuous optimization.
Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory
This paper introduces SkeMex, a self-evolving framework that enhances medical agents by distilling interaction trajectories into structured skill memory, enabling better long-term clinical reasoning through context-dependent utility estimation and governance.
EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale
EvoMaster is a scalable, self-evolving agent framework for large-scale scientific discovery that enables iterative hypothesis refinement and knowledge accumulation across experimental cycles. It achieves state-of-the-art results on four benchmarks including Humanity's Last Exam (41.1%) and MLE-Bench Lite (75.8%), outperforming general-purpose baselines by up to 316%.
EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
EvoScientist is an adaptive multi-agent framework for end-to-end scientific discovery that continuously improves through persistent memory modules, comprising three specialized agents for idea generation, experiment execution, and knowledge distillation. It outperforms 7 state-of-the-art systems in scientific idea generation and improves code execution success rates through multi-agent evolution.