Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization
Summary
Active-GRPO introduces an adaptive imitation and self-improving reasoning framework that dynamically decides when to imitate references and when to reinforce the model's own discoveries for molecular optimization, achieving statistically significant improvements over previous methods on the TOMG-Bench-MolOpt benchmark.
View Cached Full Text
Cached at: 07/02/26, 05:38 AM
# Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization Source: [https://arxiv.org/html/2607.00531](https://arxiv.org/html/2607.00531) Xuefeng Liu1,Mingxuan Cao2∗\\ast,Qinan Huang3,Thomas Brettin5,Rick L\. Stevens4,5,Le Cong1 1School of Medicine, Stanford University 2Data Science Institute, University of Chicago 3Pritzker School of Molecular Engineering, University of Chicago 4Department of Computer Science, University of Chicago 5Argonne National LaboratoryEqual Contribution\. Correspondence to: Xuefeng Liu <[xfl@stanford\.edu](https://arxiv.org/html/2607.00531v1/mailto:[email protected])\>, Mingxuan Cao <[mcao@uchicago\.edu](https://arxiv.org/html/2607.00531v1/mailto:[email protected])\> ###### Abstract Scientific reasoning is an increasingly important capability of large language models, yet improving the robustness and efficiency of training such reasoning remains a key open challenge\. We study this problem in instruction\-based molecular optimization, where answer\-only supervised fine\-tuning \(SFT\) collapses multi\-step reasoning and reinforcement learning with verifiable rewards \(RLVR\) suffers from sparse feedback\. Reference\-guided Policy Optimization \(RePO\) mitigates both by anchoring policy updates to dataset\-provided references, but its effectiveness is tightly coupled to reference quality: weak or misaligned references impose a performance ceiling\. To overcome this ceiling, we propose active reasoning, a paradigm in which the policy actively decides, on a per\-instance basis,*when*to imitate a reference and*when*to reinforce its own discoveries, while continuously upgrading*what*it imitates\. We instantiate this paradigm as Active Group Relative Policy Optimization \(Active\-GRPO\), realized through two coupled mechanisms:*active imitate\-reinforce*and*active referencing*\. The former performs imitation learning when the reference still outperforms the policy’s own candidates, and shifts to self\-improvement via reinforcement learning once the policy has generated molecules that surpass the reference\. The latter continuously upgrades the reference itself by replacing it with the best policy\-generated candidate discovered so far, progressively raising the imitation target and ensuring that reference guidance remains informative—rather than restrictive—throughout training\. Across TOMG\-BenchMolOpt, Active\-GRPO improves average SR×\\timesSim from 0\.0959 for GRPO and 0\.1665 for RePO to 0\.1773 under matched three\-seed evaluation, with statistically significant gains on LogP, MR, and QED\. ## 1Introduction Large language models \(LLMs\) have rapidly emerged as general\-purpose reasoning engines, demonstrating strong performance on tasks that demand multi\-step deliberation rather than surface pattern matching\[[17](https://arxiv.org/html/2607.00531#bib.bib17),[38](https://arxiv.org/html/2607.00531#bib.bib38)\]\. Through advances in chain\-of\-thought prompting\[[52](https://arxiv.org/html/2607.00531#bib.bib52)\], supervised fine\-tuning \(SFT\) on reasoning traces\[[36](https://arxiv.org/html/2607.00531#bib.bib36),[56](https://arxiv.org/html/2607.00531#bib.bib56)\], and reinforcement learning with verifiable rewards \(RLVR\)\[[19](https://arxiv.org/html/2607.00531#bib.bib19),[28](https://arxiv.org/html/2607.00531#bib.bib28)\], modern LLMs can now solve competition\-level mathematics\[[11](https://arxiv.org/html/2607.00531#bib.bib11),[20](https://arxiv.org/html/2607.00531#bib.bib20)\], write and debug complex code\[[9](https://arxiv.org/html/2607.00531#bib.bib9),[24](https://arxiv.org/html/2607.00531#bib.bib24)\], and conduct structured analyses across diverse domains\. This progress has motivated a growing line of work that brings LLM reasoning to bear on scientific discovery\[[1](https://arxiv.org/html/2607.00531#bib.bib1),[51](https://arxiv.org/html/2607.00531#bib.bib51)\], where success often hinges on navigating combinatorially large hypothesis spaces under domain\-specific constraints\. From hypothesis generation and experimental design to candidate screening in chemistry, biology, and materials science\[[7](https://arxiv.org/html/2607.00531#bib.bib7),[23](https://arxiv.org/html/2607.00531#bib.bib23),[49](https://arxiv.org/html/2607.00531#bib.bib49)\], LLMs are increasingly positioned not as passive question answerers but as active reasoners that propose, evaluate, and refine scientific artifacts\. Yet making such reasoning*robust*and*sample\-efficient*to train remains a central open challenge—particularly in scientific domains where outputs must satisfy strict, programmatically verifiable constraints\. Among these scientific reasoning tasks, instruction\-based molecular optimization has emerged as a particularly demanding testbed\[[29](https://arxiv.org/html/2607.00531#bib.bib29),[31](https://arxiv.org/html/2607.00531#bib.bib31)\]\. Given a source molecule and a natural\-language instruction specifying desired property changes—for example, improving aqueous solubility while preserving binding affinity—the model must propose a structurally similar yet property\-improved candidate\[[25](https://arxiv.org/html/2607.00531#bib.bib25),[26](https://arxiv.org/html/2607.00531#bib.bib26)\]\. This task lies at the heart of drug discovery\[[34](https://arxiv.org/html/2607.00531#bib.bib34),[49](https://arxiv.org/html/2607.00531#bib.bib49)\], agrochemical design\[[12](https://arxiv.org/html/2607.00531#bib.bib12)\], and materials development\[[7](https://arxiv.org/html/2607.00531#bib.bib7),[45](https://arxiv.org/html/2607.00531#bib.bib45)\], while imposing tightly coupled constraints\. Outputs must be syntactically valid molecules\[[53](https://arxiv.org/html/2607.00531#bib.bib53)\], retain a high degree of structural similarity to the input scaffold\[[4](https://arxiv.org/html/2607.00531#bib.bib4),[6](https://arxiv.org/html/2607.00531#bib.bib6)\], achieve measurable improvements across one or more—often competing—property objectives\[[8](https://arxiv.org/html/2607.00531#bib.bib8),[22](https://arxiv.org/html/2607.00531#bib.bib22)\], and faithfully follow the user’s instruction\[[14](https://arxiv.org/html/2607.00531#bib.bib14),[30](https://arxiv.org/html/2607.00531#bib.bib30)\]\. Unlike open\-ended generation, molecular optimization therefore requires constrained, multi\-objective reasoning over structured chemical objects, with every candidate verifiable against programmatic property predictors and similarity metrics\[[15](https://arxiv.org/html/2607.00531#bib.bib15),[41](https://arxiv.org/html/2607.00531#bib.bib41)\]\. Existing training paradigms for this setting exhibit characteristic failure\. Answer\-only SFT\[[39](https://arxiv.org/html/2607.00531#bib.bib39),[50](https://arxiv.org/html/2607.00531#bib.bib50)\]forces the model to memorize input–output mappings without articulating chemical rationale, collapsing multi\-step reasoning\[[10](https://arxiv.org/html/2607.00531#bib.bib10),[33](https://arxiv.org/html/2607.00531#bib.bib33)\]and limiting generalization to unseen instruction styles\. RLVR\[[19](https://arxiv.org/html/2607.00531#bib.bib19),[28](https://arxiv.org/html/2607.00531#bib.bib28)\], which optimizes directly against programmatic property checkers, in principle preserves reasoning, but in practice suffers from sparse feedback\[[2](https://arxiv.org/html/2607.00531#bib.bib2),[43](https://arxiv.org/html/2607.00531#bib.bib43)\]: under tight similarity constraints, most sampled molecules fail validity or similarity gates and receive zero reward, starving the policy of learning signal\. Reference\-guided Policy Optimization \(RePO\)\[[32](https://arxiv.org/html/2607.00531#bib.bib32)\]mitigates both pathologies by anchoring policy updates to dataset\-provided reference molecules\[[25](https://arxiv.org/html/2607.00531#bib.bib25),[29](https://arxiv.org/html/2607.00531#bib.bib29)\], blending imitation and reward\-based learning to densify the training signal and inherit the stability benefits of demonstration\-based learning\[[21](https://arxiv.org/html/2607.00531#bib.bib21),[37](https://arxiv.org/html/2607.00531#bib.bib37),[40](https://arxiv.org/html/2607.00531#bib.bib40),[42](https://arxiv.org/html/2607.00531#bib.bib42)\]\. However, RePO’s effectiveness is tied to the static quality of its references\. When references are weak, noisy, or misaligned with the instruction\[[5](https://arxiv.org/html/2607.00531#bib.bib5),[16](https://arxiv.org/html/2607.00531#bib.bib16)\], the imitation signal actively pulls the policy away from better solutions it might otherwise discover, creating a performance ceiling bounded by the dataset rather than by the policy’s true capability\. To overcome this ceiling, we proposeactive reasoning, a training paradigm where*active*refers to actively deciding when to imitate a reference, when to reinforce its own discoveries, and what target to imitate; and*reasoning*refers to the deliberative<think\>…</think\><answer\>…</answer\>generation\. We instantiate this paradigm asActive Group Relative Policy Optimization\(Active\-GRPO\), which couples active reasoning with two mechanisms:*active imitate\-reinforce*and*active referencing*\. The active imitate\-reinforce mechanism performs imitation learning when the reference still outperforms the policy’s own candidates, and shifts to self\-improvement via reinforcement learning once the policy has generated molecules that surpass the reference\. The active referencing mechanism continuously upgrades the reference itself by replacing it with the best policy\-generated candidate discovered so far, progressively raising the imitation target as training proceeds\. Together, these mechanisms ensure that reference guidance remains*informative*rather than*restrictive*, transitioning the policy from learning*from*references to learning*beyond*them\. By construction, this makes reference guidance robust across the spectrum of reference quality\. Figure 1:Conceptual motivation for Active\-GRPO\. RePO continues to imitate a fixed reference, which can become stale as the policy improves\. Active\-GRPO instead adapts both imitation strength and the guidance target, enabling a transition from reference imitation to active self\-improvement\.We evaluate Active\-GRPO across a suite of molecular optimization benchmarks spanning diverse property objectives, instruction styles, and reference\-quality regimes\. Our contributions are threefold: - •We identify and formally characterize the*static\-reference ceiling*in reference\-guided policy optimization, showing that fixed dataset references can systematically mislead training when they fall below the policy’s own capability\. - •We introduce*active reasoning*as a paradigm for reference\-guided training, and instantiate it as Active\-GRPO, which couples active imitate\-reinforce and active referencing to make reference guidance robust to reference quality and self\-improving over time\. - •We show empirically that Active\-GRPO consistently outperforms RePO and GRPO baselines, delivers more robust optimization across varying reference\-quality regimes, and achieves a better balance across competing chemical objectives—establishing adaptive reference guidance as a principled path beyond the limits of static supervision\. ## 2Preliminaries ### 2\.1Problem Formulation: Instruction\-Conditioned Molecular Generation We study instruction\-conditioned molecular generation with reference guidance\. The model receives a natural\-language specification with task\-dependent molecular context \(typically an input molecule and optimization constraints\), and produces a candidate molecule SMILES\[[53](https://arxiv.org/html/2607.00531#bib.bib53)\]satisfying the specification\. SMILES is a string representation of molecular graphs widely used in chemical language modeling and cheminformatics\. Each training instance pairs a conditioning contextcic\_\{i\}with a dataset\-provided reference moleculemref,im\_\{\\mathrm\{ref\},i\}providing answer\-level guidance during training\. ##### Data and prompts\. Let𝒟=\{zi\}i=1N\\mathcal\{D\}=\\\{z\_\{i\}\\\}\_\{i=1\}^\{N\}be a training set, where each instance iszi=\(ci,mref,i\)\.z\_\{i\}=\(c\_\{i\},m\_\{\\mathrm\{ref\},i\}\)\.Herecic\_\{i\}is a task\-dependent conditioning context, andmref,im\_\{\\mathrm\{ref\},i\}is a dataset\-provided reference molecule\. The two play different roles:cic\_\{i\}specifies the optimization problem and conditions reward evaluation, whereasmref,im\_\{\\mathrm\{ref\},i\}serves as an answer\-level guidance target during training\. ##### Reasoning\-augmented generation\. The policyπθ\\pi\_\{\\theta\}is trained to produce a structured output interleaves a reasoning trace with a final answer:o=<think\>τ</think\><answer\>m^</answer\>\.o=\\texttt\{<think\>\}\\;\\tau\\;\\texttt\{</think\>\}\\;\\texttt\{<answer\>\}\\;\\widehat\{m\}\\;\\texttt\{</answer\>\}\.Hereτ\\tauis a free\-form natural\-language reasoning trace andm^\\widehat\{m\}is the candidate molecule SMILES\. This format follows recent reasoning\-trained LLMs\[[19](https://arxiv.org/html/2607.00531#bib.bib19)\]and is well suited to molecular optimization: the trace gives the model space to identify editable substructures, weigh modifications, and check constraints before committing to a final molecule\. We do not superviseτ\\taudirectly; only the final answer span carries explicit answer\-level guidance \(Section[2\.2](https://arxiv.org/html/2607.00531#S2.SS2)\)\. ##### Reward\. We assume a verifiable rewardR\(m^;c\)R\(\\widehat\{m\};c\)defined on \(candidate, context\) pairs, with invalid or constraint\-violating molecules receiving zero reward\. InMolOpt,RRcombines the requested property improvement with structural preservation; exact components are task\-dependent\. We also define the*reference reward*vref\(zi\)=R\(mref,i;ci\)v\_\{\\text\{ref\}\}\(z\_\{i\}\)=R\(m\_\{\\text\{ref\},i\};c\_\{i\}\), used as a per\-instance baseline and as the anchor against which the policy’s best candidates are compared in our method\. ### 2\.2GRPO\-based Reasoning Optimization ##### GRPO\. Group Relative Policy Optimization \(GRPO\)\[[47](https://arxiv.org/html/2607.00531#bib.bib47)\]is an actor\-only variant of PPO\[[46](https://arxiv.org/html/2607.00531#bib.bib46)\]that replaces the learned value baseline with within\-group reward normalization\. For each promptxix\_\{i\}, GRPO samplesGGrollouts\{oi,j\}j=1G\\\{o\_\{i,j\}\\\}\_\{j=1\}^\{G\}from the old policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}, extracts candidate molecules\{m^i,j\}j=1G\\\{\\widehat\{m\}\_\{i,j\}\\\}\_\{j=1\}^\{G\}, and evaluates rewardsri,j=R\(m^i,j;ci\),r¯i=1G∑j=1Gri,j\.r\_\{i,j\}=R\(\\widehat\{m\}\_\{i,j\};c\_\{i\}\),\\bar\{r\}\_\{i\}=\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}r\_\{i,j\}\.GRPO then forms the group\-normalized advantageA^i,j=ri,j−r¯iσr,i\+ε\\widehat\{A\}\_\{i,j\}=\\frac\{r\_\{i,j\}\-\\bar\{r\}\_\{i\}\}\{\\sigma\_\{r,i\}\+\\varepsilon\}, whereσr,i\\sigma\_\{r,i\}is the within\-group reward standard deviation\. At the objective level, GRPO can be written as 𝒥GRPO\(θ\)\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{GRPO\}\}\(\\theta\)=𝔼xi∼𝒟,\{oi,j\}j=1G∼πθold\(⋅∣xi\)\[1G∑j=1G1\|oi,j\|∑t=1\|oi,j\|\(\\displaystyle=\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\_\{i\}\\sim\\mathcal\{D\},\\\\ \\\{o\_\{i,j\}\\\}\_\{j=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid x\_\{i\}\)\\end\{subarray\}\}\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}\\frac\{1\}\{\|o\_\{i,j\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i,j\}\|\}\\Big\(\(1\)min\[ρi,j,t\(θ\)A^i,j,clip\(ρi,j,t\(θ\),1−ϵ,1\+ϵ\)A^i,j\]−βKLDi,j,tKL\)\],\\displaystyle\\qquad\\qquad\\min\\\!\\Big\[\\rho\_\{i,j,t\}\(\\theta\)\\widehat\{A\}\_\{i,j\},\\,\\mathrm\{clip\}\\\!\\bigl\(\\rho\_\{i,j,t\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\bigr\)\\widehat\{A\}\_\{i,j\}\\Big\]\-\\beta\_\{\\mathrm\{KL\}\}\\,D^\{\\mathrm\{KL\}\}\_\{i,j,t\}\\Big\)\\Bigg\],whereρi,j,t\(θ\)=πθ\(oi,j,t∣xi,oi,j,<t\)πθold\(oi,j,t∣xi,oi,j,<t\)\.\\rho\_\{i,j,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(o\_\{i,j,t\}\\mid x\_\{i\},o\_\{i,j,<t\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(o\_\{i,j,t\}\\mid x\_\{i\},o\_\{i,j,<t\}\)\}\.In implementation, we optimize the negative empirical counterpart of this objective\. For clarity, we denote the resulting minibatch loss by ℒRL=−1∑i,j,tmi,j,tcmp∑i,j,t\[ρi,j,tA^i,j−βKLDi,j,tKL\]mi,j,tcmp,\\mathcal\{L\}\_\{\\mathrm\{RL\}\}=\-\\frac\{1\}\{\\sum\_\{i,j,t\}m^\{\\mathrm\{cmp\}\}\_\{i,j,t\}\}\\sum\_\{i,j,t\}\\Big\[\\rho\_\{i,j,t\}\\widehat\{A\}\_\{i,j\}\-\\beta\_\{\\mathrm\{KL\}\}D^\{\\mathrm\{KL\}\}\_\{i,j,t\}\\Big\]m^\{\\mathrm\{cmp\}\}\_\{i,j,t\},wheremi,j,tcmpm^\{\\mathrm\{cmp\}\}\_\{i,j,t\}is the completion mask,ρi,j,t\\rho\_\{i,j,t\}is the per\-token policy ratio, andDi,j,tKLD^\{\\mathrm\{KL\}\}\_\{i,j,t\}is the per\-token KL penalty against the frozen reference policyπref\\pi\_\{\\mathrm\{ref\}\}\. Because GRPO learns from within\-prompt relative reward differences, its signal can weaken when all sampled rollouts for a prompt receive similar rewards\. This motivates adding answer\-level guidance in reference\-guided molecular optimization\. ##### GRPO with reference\-molecule guidance\. RePO\[[32](https://arxiv.org/html/2607.00531#bib.bib32)\]augments GRPO with an answer\-level imitation loss that pulls the policy toward the dataset reference molecule, adding supervised signal to relative reward optimization\. Given a target moleculemm, RePO defines ℒguide\(i\)\(m\)=−∑tlogπθ\(mt∣xi,m<t\)mi,tans∑tmi,tans\+ε,\\mathcal\{L\}\_\{\\mathrm\{guide\}\}^\{\(i\)\}\(m\)=\-\\frac\{\\sum\_\{t\}\\log\\pi\_\{\\theta\}\(m\_\{t\}\\mid x\_\{i\},m\_\{<t\}\)\\,m^\{\\mathrm\{ans\}\}\_\{i,t\}\}\{\\sum\_\{t\}m^\{\\mathrm\{ans\}\}\_\{i,t\}\+\\varepsilon\},wheremi,tansm^\{\\mathrm\{ans\}\}\_\{i,t\}masks only the answer span; the reasoning traceτ\\tauis not supervised\. In vanilla RePO, the guidance target is fixed to the dataset reference,mi∗=mref,im\_\{i\}^\{\*\}=m\_\{\\mathrm\{ref\},i\}, and the objective is ℒRePO=ℒRL\+1B∑i=1Bℒguide\(i\)\(mref,i\)\.\\mathcal\{L\}\_\{\\mathrm\{RePO\}\}=\\mathcal\{L\}\_\{\\mathrm\{RL\}\}\+\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\mathcal\{L\}\_\{\\mathrm\{guide\}\}^\{\(i\)\}\(m\_\{\\mathrm\{ref\},i\}\)\. ##### Limitation: The Static\-Reference Ceiling\. RePO’s design tacitly assumes that the dataset reference is consistently a useful target\. This assumption breaks down in two practically common regimes\.*\(i\) Reference saturation:*once the policy starts generating molecules whose reward matches or exceedsvref\(zi\)v\_\{\\mathrm\{ref\}\}\(z\_\{i\}\), continuing to imitatemref,im\_\{\\mathrm\{ref\},i\}pulls the policy back toward a strictly worse target\.*\(ii\) Weak references:*when the dataset reference is itself far from optimal — for example, when references are automatically curated or noisy — guidance towardmref,im\_\{\\mathrm\{ref\},i\}caps achievable performance well below what the policy could otherwise reach\. In both regimes, two design choices that fixed\-reference guidance cannot make become first\-class degrees of freedom:*when*to imitate at all, and*what*to imitate\. Our method \(Section[3](https://arxiv.org/html/2607.00531#S3)\) makes both choices adaptive and per\-instance\. ## 3Algorithm: Active\-GRPO We now instantiate the active reasoning paradigm — letting the policy decide, per instance, when to imitate a reference and when to reinforce its own discoveries, while continuously upgrading what it imitates — as Active Group Relative Policy Optimization \(Active\-GRPO\)\. Active\-GRPO augments the RePO objective with two coupled mechanisms: - •Active imitate\-reinforcedecideswhento imitate via a smooth, context\-dependent guidance weight that compares the policy’s current best samples against the reference\. - •Active referencingdecideswhatto imitate via a per\-instance memory bank that promotes policy\-generated candidates once they outperform the dataset reference\. Figure 2:Overview of Active\-GRPO\. The method augments reference\-guided policy optimization with two active decisions: when to imitate, through a dynamic guidance weight, and what to imitate, through active reference selection over promoted policy\-discovered candidates\.##### Active Imitate\-Reinforce\. We adaptively blend imitation and reinforcement learning to achieve robust policy improvement, allowing the learner to switch between imitating an oracle and improving via RL based on online relative performance\. In our setting, the dataset reference serves as the initial oracle, and policy\-generated candidates can themselves become improved oracles once they surpass it\. Rather than enforcing a hard switch, we implement this idea as a smooth, per\-instance guidance weight\. For each contextcic\_\{i\}, let vtop\(ci\)=1k∑j∈Top\-kR\(m^i,j;ci\)v\_\{\\mathrm\{top\}\}\(c\_\{i\}\)=\\frac\{1\}\{k\}\\sum\_\{j\\in\\mathrm\{Top\}\\text\{\-\}k\}R\(\\widehat\{m\}\_\{i,j\};c\_\{i\}\)be the mean reward of the top\-kksampled candidates from the current rollout group\. Settingk=1k=1recovers the best sampled candidate; largerkkyields a smoother top\-kkaverage\. We define the context\-dependent guidance weight βguide\(ci\)=βmin\+\(βmax−βmin\)σ\(−α\(vtop\(ci\)−vref\(zi\)\)\),\\beta\_\{\\mathrm\{guide\}\}\(c\_\{i\}\)=\\beta\_\{\\min\}\+\(\\beta\_\{\\max\}\-\\beta\_\{\\min\}\)\\,\\sigma\\\!\\left\(\-\\alpha\\bigl\(v\_\{\\mathrm\{top\}\}\(c\_\{i\}\)\-v\_\{\\mathrm\{ref\}\}\(z\_\{i\}\)\\bigr\)\\right\),whereσ\(⋅\)\\sigma\(\\cdot\)is the logistic sigmoid andα\>0\\alpha\>0controls sharpness of the transition\. The semantics are: - •Policy lags the reference \(vtop<vrefv\_\{\\mathrm\{top\}\}<v\_\{\\mathrm\{ref\}\}\):βguide\\beta\_\{\\mathrm\{guide\}\}is large, and the update is imitation\-dominant\. - •Policy surpasses the reference \(vtop\>vrefv\_\{\\mathrm\{top\}\}\>v\_\{\\mathrm\{ref\}\}\):βguide\\beta\_\{\\mathrm\{guide\}\}shrinks, and the update shifts toward reinforcement learning\. The boundsβmin\\beta\_\{\\min\}andβmax\\beta\_\{\\max\}control, respectively, the residual imitation pressure once the policy has surpassed the reference and the maximum imitation weight when it lags\. The choice ofβmin\\beta\_\{\\min\}in particular determines the late\-training regime:βmin=0\\beta\_\{\\min\}=0recovers pure RL in the limit, whileβmin\>0\\beta\_\{\\min\}\>0retains a residual self\-distillation signal toward the current best target Appendix[A\.6](https://arxiv.org/html/2607.00531#A1.SS6)\. ##### Active referencing\. Active referencing replaces the static dataset reference with the best policy\-discovered candidate available so far\. For each contextcic\_\{i\}, Active\-GRPO maintains a capacity\-limited memory bank ℬ\(ci\)=\{\(mℓ,rℓ\)\}ℓ=1\|ℬ\(ci\)\|,\|ℬ\(ci\)\|≤K,\\mathcal\{B\}\(c\_\{i\}\)=\\\{\(m\_\{\\ell\},r\_\{\\ell\}\)\\\}\_\{\\ell=1\}^\{\|\\mathcal\{B\}\(c\_\{i\}\)\|\},\\qquad\|\\mathcal\{B\}\(c\_\{i\}\)\|\\leq K,initialized with the dataset\-provided reference molecule, ℬ\(ci\)←\{\(mref,i,vref\(zi\)\)\},\{\\mathcal\{B\}\(c\_\{i\}\)\\leftarrow\\\{\(m\_\{\\mathrm\{ref\},i\},v\_\{\\mathrm\{ref\}\}\(z\_\{i\}\)\)\\\},\}and is subsequently augmented with policy\-generated molecules that are promoted during training\. Thus, each bank entry consists of a candidate guidance moleculemℓm\_\{\\ell\}and its rewardrℓ=R\(mℓ;ci\)r\_\{\\ell\}=R\(m\_\{\\ell\};c\_\{i\}\)under the same conditioning context\. The bank is keyed by a deterministic per\-example identifier and persists across training steps and epochs, thus each example accumulates a best\-so\-far set over the course of training\. We store up toKKcandidates rather than only the current best one so that the bank represents a small per\-instance candidate set\. Although our main implementation selects the maximum\-reward candidate as the active target, the same bank can support more robust reference statistics, such as top\-kkaveraging or diversity\-aware target selection\. Promotion and eviction\.A generated moleculem^\\widehat\{m\}is promoted into the bank when it improves on the current reference reward by a marginδ\\deltaand satisfies a task\-specific admissibility predicate: Promote\(m^∣ci\)=𝟏\[R\(m^;ci\)\>vref\(zi\)\+δ∧Q\(m^;ci\)∧Valid\(m^\)\]\.\\mathrm\{Promote\}\(\\widehat\{m\}\\mid c\_\{i\}\)=\\mathbf\{1\}\\\!\\Big\[R\(\\widehat\{m\};c\_\{i\}\)\>v\_\{\\mathrm\{ref\}\}\(z\_\{i\}\)\+\\delta\\;\\wedge\\;Q\(\\widehat\{m\};c\_\{i\}\)\\;\\wedge\\;\\mathrm\{Valid\}\(\\widehat\{m\}\)\\Big\]\.The marginδ\\deltaguards against promoting near\-tie noise; the predicateQ\(m^;ci\)Q\(\\widehat\{m\};c\_\{i\}\)enforces task\-specific hard constraints \(in our experiments, a minimum Tanimoto similarity to the input molecule\) that are kept separate from the scalar reward to prevent structurally invalid candidates from entering the bank merely because they score high on one reward component\. When the bank is full, the lowest\-reward entry is evicted\. Promoted molecules are canonicalized before insertion so that supervision targets a deterministic SMILES surface form \(Appendix[A\.3](https://arxiv.org/html/2607.00531#A1.SS3),[A\.4](https://arxiv.org/html/2607.00531#A1.SS4)\)\. Active guidance target\.The target supplied toℒguide\\mathcal\{L\}\_\{\\mathrm\{guide\}\}is the highest\-reward entry currently in the bank, m∗\(ci\)=argmax\(m,r\)∈ℬ\(ci\)r\.m^\{\*\}\(c\_\{i\}\)=\\arg\\max\_\{\(m,r\)\\in\\mathcal\{B\}\(c\_\{i\}\)\}r\.Thus, once the policy discovers a molecule better than the original dataset reference, subsequent guidance distills from the best available candidate rather than from the static reference\. Optimization Objective\.For a minibatch of sizeBB, Active\-GRPO optimizes ℒActive−GRPO=ℒRL\+1B∑i=1Bβguide\(ci\)ℒguide\(i\)\(m∗\(ci\)\)\.\\mathcal\{L\}\_\{\{Active\-GRPO\}\}=\\mathcal\{L\}\_\{\\mathrm\{RL\}\}\+\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\beta\_\{\\mathrm\{guide\}\}\(c\_\{i\}\)\\,\\mathcal\{L\}\_\{\\mathrm\{guide\}\}^\{\(i\)\}\\\!\\bigl\(m^\{\*\}\(c\_\{i\}\)\\bigr\)\.Crucially, the guidance loss is weighted*per instance*rather than by a batch\-averaged coefficient\. This preserves the intended adaptive behavior: different examples within the same minibatch may simultaneously operate in different regimes of imitation versus self\-improvement, depending on how their current top\-kkrollouts compare to their respective references\. Together, the two mechanisms make reference guidance robust by construction across the spectrum of reference quality\. Relative to RePO\[[32](https://arxiv.org/html/2607.00531#bib.bib32)\], Active\-GRPO introduces two essential changes: a dynamic, context\-dependent guidance coefficient that decides when and how strongly to imitate, and an active referencing mechanism that replaces static reference guidance with the best available target discovered during training\. Algorithm[1](https://arxiv.org/html/2607.00531#alg1)in Appendix[A\.8](https://arxiv.org/html/2607.00531#A1.SS8)summarizes the complete training procedure; additional implementation, synchronization, reproducibility, and hyperparameter details are provided in Appendix[A](https://arxiv.org/html/2607.00531#A1)\. ## 4Experiments We evaluate Active\-GRPO on instruction\-conditioned molecular optimization benchmarks and compare it against RePO\[[32](https://arxiv.org/html/2607.00531#bib.bib32)\]and related baselines\. Our experiments address four questions: - •Q1\. Does Active\-GRPO improve molecular optimization performance over fixed\-reference guidance? \(Section[4\.2](https://arxiv.org/html/2607.00531#S4.SS2)\) - •Q2\. Are the gains explained by simpler alternatives such as iterative self\-distillation or stronger fixed references? \(Sections[4\.2](https://arxiv.org/html/2607.00531#S4.SS2); ablations in Sections[4\.3](https://arxiv.org/html/2607.00531#S4.SS3)\) - •Q3\. Does Active\-GRPO’s advantage grow with the optimization headroom of each instance, as the static\-reference ceiling argument predicts? \(Section[4\.4](https://arxiv.org/html/2607.00531#S4.SS4)\) - •Q4\. Do training\-time dynamics match the intended adaptive mechanism? \(Section[4\.5](https://arxiv.org/html/2607.00531#S4.SS5)\) ### 4\.1Experimental Setup ##### Benchmarks and metrics\. We evaluate on TOMG\-Bench\[[29](https://arxiv.org/html/2607.00531#bib.bib29)\], focusing on theMolOptsubtasks LogP, MR, and QED\. These tasks provide a controlled setting for reference\-guided, instruction\-conditioned molecular optimization\. Following RePO\[[32](https://arxiv.org/html/2607.00531#bib.bib32)\], we report Success Rate \(SR\), Tanimoto Similarity \(Sim\) with Morgan fingerprints, and the composite metric SR×\\timesSim, which summarizes the trade\-off between task success and structural preservation\. Additional evaluations on TOMG\-BenchMolEdit, hard\-example splits, and longer\-horizon settings are reported in Appendix[C](https://arxiv.org/html/2607.00531#A3)\. ##### Baselines and variants\. We compare against zero\-shot inference, GRPO\-only training without answer\-level guidance, RePO as the fixed\-reference baseline, and two alternatives that test simpler explanations: Iterative SFT, which captures self\-distillation without active policy improvement, and Offline\-strengthened RePO, which replaces the dataset reference with a stronger fixed target before training\. We also include two ablations of Active\-GRPO:β\\beta\-only, which uses active imitate\-reinforce without active referencing, and bank\-only, which uses active referencing with a fixed guidance weight\. ##### Training and reporting protocol\. All trained methods share the same backbone, reward family, rollout budget, decoding rule, and evaluation pipeline under a matched 40GB\-A100 configuration, which may differ from prior RePO reports\[[32](https://arxiv.org/html/2607.00531#bib.bib32)\]; we therefore interpret results as matched relative comparisons\. We report mean±\\pmstandard error over three seeds \(zero\-shot once\), with significance assessed via per\-example paired bootstrap \(10,000 resamples\)\. Implementation details, hyperparameters, and timing are in Appendix[A](https://arxiv.org/html/2607.00531#A1)and[C\.4](https://arxiv.org/html/2607.00531#A3.SS4)\. ### 4\.2Main Results on TOMG\-Bench MolOpt Table 1:Main results on TOMG\-BenchMolOpt\. We report SR×\\timesSim, the standard composite metric balancing task success and structural preservation\. All trained methods are reported as mean±\\pmstandard error over three seeds; zero\-shot is evaluated once\. Active\-GRPO achieves the best performance on all three subtasks and the highest average score\. All trained methods are run under the same matched 40GB\-A100 configuration\.MethodLogP SR×\\timesSim↑\\uparrowMR SR×\\timesSim↑\\uparrowQED SR×\\timesSim↑\\uparrowAvg SR×\\timesSim↑\\uparrowZero\-shot0\.17000\.13140\.11140\.1376GRPO\-only0\.1222±\\pm0\.01120\.0956±\\pm0\.01320\.0699±\\pm0\.00610\.0959Iterative SFT0\.1888±\\pm0\.02660\.1736±\\pm0\.02000\.1257±\\pm0\.01160\.1627Offline\-strengthened RePO0\.1974±\\pm0\.00760\.1853±\\pm0\.00450\.1130±\\pm0\.01080\.1652RePO0\.1877±\\pm0\.00760\.1860±\\pm0\.00890\.1258±\\pm0\.00330\.1665Active\-GRPO \(ours\)0\.1977±\\pm0\.01040\.1904±\\pm0\.01000\.1440±\\pm0\.00810\.1773 Table[1](https://arxiv.org/html/2607.00531#S4.T1)reports the main results on TOMG\-BenchMolOpt\. Active\-GRPO obtains the highest SR×\\timesSim on all three subtasks and the best overall average\. Relative to RePO, the gains are \+0\.0148 on LogP, \+0\.0090 on MR, and \+0\.0206 on QED, all significant under paired bootstrap testing \(p<0\.001p<0\.001; Appendix[C\.2](https://arxiv.org/html/2607.00531#A3.SS2)\)\. The comparison rules out two simpler explanations\. Iterative SFT improves over GRPO\-only but stays below both RePO and Active\-GRPO, showing that self\-distillation alone is insufficient\. Offline\-strengthened RePO is the strongest non\-adaptive alternative, yet still falls below Active\-GRPO, indicating the gain is not explained by replacing the reference with a stronger fixed target\. The advantage comes from making guidance adaptive and per\-instance, not merely stronger\. ##### Success–similarity trade\-off\. Active\-GRPO’s improvement comes primarily from higher success rates\. Averaged across the three subtasks, SR rises from 0\.2017 \(RePO\) to 0\.2249, while average similarity decreases from 0\.8281 to 0\.7906\. This is consistent with the design: active imitate\-reinforce reduces unnecessary imitation pressure, and active referencing lets the policy move beyond the local neighborhood of the fixed reference when doing so improves reward\. As we show in Section[4\.6](https://arxiv.org/html/2607.00531#S4.SS6), much of RePO’s higher similarity reflects no\-op behavior — copying the source molecule — rather than successful structural preservation under a real edit\. Full per\-metric results are in Appendix[C\.1](https://arxiv.org/html/2607.00531#A3.SS1)\. ### 4\.3Ablation: Active Imitate\-Reinforce and Active Referencing Are Complementary Table[2](https://arxiv.org/html/2607.00531#S4.T2)isolates the two core mechanisms in Active\-GRPO\. On average SR×\\timesSim, neither component alone improves over RePO: the active\-imitate\-reinforce\-only variant underperforms RePO, while the active\-referencing\-only variant is approximately tied\. The full method, however, improves over all variants by a clear margin, indicating that active imitate\-reinforce and active referencing are complementary rather than redundant\. Table 2:Ablation results on TOMG\-BenchMolOpt\. Neither active imitate\-reinforce nor active referencing alone explains the gain\. The full method achieves the best score, indicating that the two mechanisms play complementary roles and are most effective when coupled\.VariantLogP SR×\\timesSim↑\\uparrowMR SR×\\timesSim↑\\uparrowQED SR×\\timesSim↑\\uparrowAvg SR×\\timesSim↑\\uparrowΔ\\Deltavs FullRePO0\.1877±\\pm0\.00760\.1860±\\pm0\.00890\.1258±\\pm0\.00330\.1665−0\.0108\-0\.0108Active\-GRPO \(active\-imitate\-reinforce only\)0\.1861±\\pm0\.00990\.1791±\\pm0\.01010\.1220±\\pm0\.00980\.1624−0\.0149\-0\.0149Active\-GRPO \(active\-referencing only\)0\.1920±\\pm0\.00780\.1781±\\pm0\.00460\.1275±\\pm0\.00720\.1659−0\.0114\-0\.0114Active\-GRPO \(full\)0\.1977±\\pm0\.01040\.1904±\\pm0\.01000\.1440±\\pm0\.00810\.1773— This pattern supports the design\. Active imitate\-reinforce controls*when*and how strongly the model should imitate, while active referencing controls*what*target to imitate\. Using only one mechanism leaves coordination incomplete: active\-imitate\-reinforce only can reduce imitation pressure but still imitates a fixed target, whereas active\-referencing only can update the target but cannot adapt guidance strength\. The full method combines both, allowing the policy to reduce stale\-reference pressure while distilling from stronger policy\-discovered targets\. ### 4\.4Optimization\-Headroom Conditional Analysis The static\-reference ceiling argument predicts a specific empirical pattern: Active\-GRPO should help most on instances where the static source\-reference offers the weakest guidance toward the requested edit — that is, where there is more optimization headroom beyond the reference\. We test this directly\. For each test example, we compute an optimization\-headroom score from the source molecule’s original property value: larger original LogP/MR for the decrease tasks, and1−QED1\-\\text\{QED\}for increase\-QED\. We partition test examples into quintiles and compare Active\-GRPO against RePO within each bin\. \(This measures optimization headroom rather than an independent reference\-quality gap; here the source molecule also serves as the static anchor\.\) The results support the central mechanism of Active\-GRPO\. The trend is clearest for LogP, where the Active\-GRPO –RePO gain in SR×Sim grows monotonically from \+0\.002 in Q1 to \+0\.031 in Q5\. MR and QED show positive but noisier patterns, with QED gaining in both low\- and high\-headroom regimes\. Active\-GRPO is most useful precisely where the static reference is least informative: as headroom grows, active imitate\-reinforce reduces unnecessary imitation pressure, and active referencing supplies stronger policy\-discovered targets\. Full per\-subtask tables are in Appendix[C\.3](https://arxiv.org/html/2607.00531#A3.SS3)\.  LogP headroom breakdown BinRangeΔ\\DeltaSRΔ\\DeltaSR×\\timesSimQ1\(−3\.91,1\.30\]\(\-3\.91,1\.30\]\+0\.007\+0\.002Q2\(1\.30,2\.25\]\(1\.30,2\.25\]\+0\.014\+0\.007Q3\(2\.25,2\.94\]\(2\.25,2\.94\]\+0\.024\+0\.013Q4\(2\.94,3\.63\]\(2\.94,3\.63\]\+0\.032\+0\.022Q5\(3\.63,7\.36\]\(3\.63,7\.36\]\+0\.045\+0\.031 Figure 3:Optimization\-headroom conditional analysis\. Left: Active\-GRPO–RePO gain in SR×\\timesSim across headroom quintiles for all threeMolOptsubtasks\. Right: the LogP numerical breakdown, where the gain increases monotonically from the lowest\-headroom bin to the highest\-headroom bin\. Full per\-subtask numerical tables are reported in Appendix[C\.3](https://arxiv.org/html/2607.00531#A3.SS3)\. ### 4\.5Training Dynamics Match the Intended Mechanism We verify that Active\-GRPO behaves as designed by inspecting training\-time statistics\. Figure[4](https://arxiv.org/html/2607.00531#S4.F4)reports raw training reward, training loss, average guidance weight, and memory\-bank usage\. Three patterns are consistent with the intended adaptive mechanism: - •Guidance shifts from imitation toward RL\.The average guidance weightβguide\\beta\_\{\\mathrm\{guide\}\}starts around 1\.1 — above the midpoint, indicating early\-training imitation dominance — and decreases to about 0\.93, indicating the policy has caught up to or surpassed references on a substantial fraction of examples\. - •The memory bank fills steadily\.Active\-GRPO accumulates 160–190 promoted entries by the end of training, confirming the policy regularly produces molecules superior to their dataset references\. - •Self\-distillation activates meaningfully\.Roughly 17–22% of examples receive guidance from a policy\-promoted target rather than the dataset reference, confirming that active referencing is a substantive contributor to training\. Together, these dynamics show that Active\-GRPO is not merely reweighting the RePO loss: it actively changes both the strength and the target of guidance during training, as predicted by the active reasoning paradigm\. Figure 4:Active\-GRPO training dynamics over three seeds, with mean±\\pmstandard deviation\. Panel \(a\) shows the raw training reward signal logged by the trainer, rather than evaluation SR×\\timesSim; panels \(b\)–\(d\) show training loss, average guidance weight, and memory\-bank/self\-distillation activity\. ### 4\.6Qualitative Case Studies We inspect representative examples in which Active\-GRPO succeeds and RePO fails\. Across LogP, MR, and QED, two characteristic failure modes of fixed\-reference guidance recur: RePO often \(i\) returns the source molecule essentially unchanged or \(ii\) makes a structural modification that moves the property in the worse direction\. Active\-GRPO instead tends to make targeted edits that satisfy the property objective while preserving most of the input structure\. For example, Active\-GRPO expands a ring to increase LogP, shortens a ring system to decrease MR, and introduces a small heteroatom change to decrease QED\. These patterns help explain Active\-GRPO’s success–similarity trade\-off: much of RePO’s higher average similarity comes from no\-op outputs that satisfy the similarity term because no edit was made\. We quantify this directly via a no\-op failure rate — the fraction of outputs withSim\(m^,msrc\)≥0\.98\\text\{Sim\}\(\\widehat\{m\},m\_\{\\mathrm\{src\}\}\)\\geq 0\.98that nevertheless fail the property objective\. Across all three subtasks, Active\-GRPO reduces no\-op failures relative to RePO \(Appendix[C\.5](https://arxiv.org/html/2607.00531#A3.SS5), Figure[6](https://arxiv.org/html/2607.00531#A3.F6)\), confirming that its slightly lower similarity reflects more frequent successful editing rather than excessive structural deviation\. ### 4\.7Additional Evaluations We report additional evaluations in Appendix[C](https://arxiv.org/html/2607.00531#A3), including qualitative case studies and visible reasoning\-trace examples \(Appendix[C\.5](https://arxiv.org/html/2607.00531#A3.SS5)\), hyperparameter sensitivity \(Appendix[C\.6](https://arxiv.org/html/2607.00531#A3.SS6)\), hard\-example stress tests \(Appendix[C\.7](https://arxiv.org/html/2607.00531#A3.SS7)\), matchedMolEditstructural optimization \(Appendix[C\.9](https://arxiv.org/html/2607.00531#A3.SS9)\), and longer\-horizon single\-seed evaluation \(Appendix[C\.8](https://arxiv.org/html/2607.00531#A3.SS8)\)\. These results support the main finding while clarifying the method’s scope\. On a ZINC\-derived hard subset, Active\-GRPO maintains a small but consistent advantage over RePO\. OnMolEdit, Active\-GRPO improves average SR×\\timesSim over RePO and performs best on AddComponent and SubComponent, suggesting that the adaptive mechanisms transfer beyond property\-only optimization\. The longer\-horizon evaluation further shows competitive average performance under shared\-policy training\. Broader discovery\-leaning and strongly multi\-objective settings remain important directions for future work\. ## 5Conclusion We introduced Active\-GRPO, an active extension of reference\-guided policy optimization that adapts both*when*to imitate through a context\-dependent guidance weight and*what*to imitate through active referencing over policy\-discovered candidates\. On TOMG\-BenchMolOpt, Active\-GRPO achieves the best average SR×\\timesSim and improves over RePO on all three subtasks under matched multi\-seed evaluation, with ablations showing that dynamic guidance and active referencing play complementary roles\. Our results suggest that fixed\-reference guidance is most limiting when the source molecule leaves substantial optimization headroom, while adaptive guidance better supports continued policy improvement\. A current limitation is that our evaluation remains centered on reference\-guided molecular optimization; extending the framework to broader discovery settings with weak or absent references may require stronger candidate proposal and exploration mechanisms\. One promising direction is to combine Active\-GRPO with active sampling or curriculum construction, focusing training compute on prompts where the policy, reference, and reward signal disagree most\. #### Acknowledgements We thank Xuan Li, Bo Han for their helpful discussion\. This work is supported by Donald and Delia Baxter Foundation Faculty Scholar award, the Weintz family foundation and AI4Biomedicine fund\. ## References - AI4Science and Quantum \[2023\]Microsoft Research AI4Science and Microsoft Azure Quantum\.The impact of large language models on scientific discovery: a preliminary study using gpt\-4\.*arXiv preprint arXiv:2311\.07361*, 2023\. - Andrychowicz et al\. \[2017\]Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba\.Hindsight experience replay\.*Advances in neural information processing systems*, 30, 2017\. - Anthony et al\. \[2017\]Thomas Anthony, Zheng Tian, and David Barber\.Thinking fast and slow with deep learning and tree search\.*Advances in neural information processing systems*, 30, 2017\. - Bajusz et al\. \[2015\]Dávid Bajusz, Anita Rácz, and Károly Héberger\.Why is tanimoto index an appropriate choice for fingerprint\-based similarity calculations?*Journal of cheminformatics*, 7\(1\):20, 2015\. - Belkhale et al\. \[2023\]Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh\.Data quality in imitation learning\.*Advances in neural information processing systems*, 36:80375–80395, 2023\. - Bemis and Murcko \[1996\]Guy W Bemis and Mark A Murcko\.The properties of known drugs\. 1\. molecular frameworks\.*Journal of medicinal chemistry*, 39\(15\):2887–2893, 1996\. - Boiko et al\. \[2023\]Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes\.Autonomous chemical research with large language models\.*Nature*, 624\(7992\):570–578, 2023\. - Brown et al\. \[2019\]Nathan Brown, Marco Fiscato, Marwin HS Segler, and Alain C Vaucher\.Guacamol: benchmarking models for de novo molecular design\.*Journal of chemical information and modeling*, 59\(3\):1096–1108, 2019\. - Chen et al\. \[2021\]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al\.Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*, 2021\. - Chu et al\. \[2025\]Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma\.Sft memorizes, rl generalizes: A comparative study of foundation model post\-training\.*arXiv preprint arXiv:2501\.17161*, 2025\. - Cobbe et al\. \[2021\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\. - Djoumbou\-Feunang et al\. \[2023\]Yannick Djoumbou\-Feunang, Jeremy Wilmot, John Kinney, Pritam Chanda, Pulan Yu, Avery Sader, Max Sharifi, Scott Smith, Junjun Ou, Jie Hu, Elizabeth Shipp, Dirk Tomandl, and Siva P\. Kumpatla\.Cheminformatics and artificial intelligence for accelerating agrochemical discovery\.*Frontiers in Chemistry*, 11:1292027, 2023\.doi:10\.3389/fchem\.2023\.1292027\. - Dong et al\. \[2023\]Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang\.Raft: Reward ranked finetuning for generative foundation model alignment\.*arXiv preprint arXiv:2304\.06767*, 2023\. - Edwards et al\. \[2022\]Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji\.Translation between molecules and natural language\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pages 375–413, 2022\. - Gao et al\. \[2022\]Wenhao Gao, Tianfan Fu, Jimeng Sun, and Connor Coley\.Sample efficiency matters: a benchmark for practical molecular optimization\.*Advances in neural information processing systems*, 35:21342–21357, 2022\. - Gao et al\. \[2024\]Yang Gao, Dana Alon, and Donald Metzler\.Impact of preference noise on the alignment performance of generative language models\.*arXiv preprint arXiv:2404\.09824*, 2024\. - Grattafiori et al\. \[2024\]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\. - Gulcehre et al\. \[2023\]Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al\.Reinforced self\-training \(rest\) for language modeling\.*arXiv preprint arXiv:2308\.08998*, 2023\. - Guo et al\. \[2025\]Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\. - Hendrycks et al\. \[2021\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the math dataset\.*arXiv preprint arXiv:2103\.03874*, 2021\. - Hester et al\. \[2018\]Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al\.Deep Q\-learning from demonstrations\.In*Proceedings of the National Conference on Artificial Intelligence \(AAAI\)*, 2018\. - Huang et al\. \[2021\]Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik\.Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development\.*arXiv preprint arXiv:2102\.09548*, 2021\. - Jablonka et al\. \[2024\]Kevin Maik Jablonka, Philippe Schwaller, Andres Ortega\-Guerrero, and Berend Smit\.Leveraging large language models for predictive chemistry\.*Nature Machine Intelligence*, 6:161–169, 2024\.doi:10\.1038/s42256\-023\-00788\-1\. - Jimenez et al\. \[2023\]Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan\.Swe\-bench: Can language models resolve real\-world github issues?*arXiv preprint arXiv:2310\.06770*, 2023\. - Jin et al\. \[2018\]Wengong Jin, Kevin Yang, Regina Barzilay, and Tommi Jaakkola\.Learning multimodal graph\-to\-graph translation for molecular optimization\.*arXiv preprint arXiv:1812\.01070*, 2018\. - Jin et al\. \[2020\]Wengong Jin, Regina Barzilay, and Tommi Jaakkola\.Hierarchical generation of molecular graphs using structural motifs\.In*International conference on machine learning*, pages 4839–4848\. PMLR, 2020\. - Kojima et al\. \[2022\]Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa\.Large language models are zero\-shot reasoners\.*Advances in neural information processing systems*, 35:22199–22213, 2022\. - Lambert et al\. \[2024\]Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al\.Tulu 3: Pushing frontiers in open language model post\-training\.*arXiv preprint arXiv:2411\.15124*, 2024\. - Li et al\. \[2024a\]Jiatong Li, Junxian Li, Yunqing Liu, Dongzhan Zhou, and Qing Li\.Tomg\-bench: Evaluating llms on text\-based open molecule generation\.*arXiv preprint arXiv:2412\.14642*, 2024a\. - Li et al\. \[2024b\]Jiatong Li, Yunqing Liu, Wenqi Fan, Xiao\-Yong Wei, Hui Liu, Jiliang Tang, and Qing Li\.Empowering molecule discovery for molecule\-caption translation with large language models: A chatgpt perspective\.*IEEE transactions on knowledge and data engineering*, 36\(11\):6071–6083, 2024b\. - Li et al\. \[2026a\]Xuan Li, Zhanke Zhou, Zongze Li, Jiangchao Yao, Yu Rong, Lu Zhang, and Bo Han\.Reference\-guided policy optimization for molecular optimization via llm reasoning\.*arXiv preprint arXiv:2603\.05900*, 2026a\. - Li et al\. \[2026b\]Xuan Li, Zhanke Zhou, Zongze Li, Jiangchao Yao, Yu Rong, Lu Zhang, and Bo Han\.Repo: Reference\-guided policy optimization for molecular optimization via llm reasoning\.In*International Conference on Learning Representations \(ICLR\)*, 2026b\. - Lin et al\. \[2023\]Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi\.The unlocking spell on base llms: Rethinking alignment via in\-context learning\.*arXiv preprint arXiv:2312\.01552*, 2023\. - Lipinski and Hopkins \[2004\]Christopher Lipinski and Andrew Hopkins\.Navigating chemical space for biology and medicine\.*Nature*, 432\(7019\):855–861, 2004\. - Liu et al\. \[2023\]Xuefeng Liu, Takuma Yoneda, Rick L Stevens, Matthew R Walter, and Yuxin Chen\.Blending imitation and reinforcement learning for robust policy improvement\.*arXiv preprint arXiv:2310\.01737*, 2023\. - Muennighoff et al\. \[2025\]Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei\-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto\.s1: Simple test\-time scaling\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 20286–20332, 2025\. - Nair et al\. \[2018\]Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel\.Overcoming exploration in reinforcement learning with demonstrations\.In*Proceedings of the IEEE International Conference on Robotics and Automation \(ICRA\)*, pages 6292–6299, 2018\. - OpenAI \[2023\]R OpenAI\.Gpt\-4 technical report\. arxiv 2303\.08774\.*View in Article*, 2\(5\):1, 2023\. - Ouyang et al\. \[2022\]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.Training language models to follow instructions with human feedback\.*Advances in neural information processing systems*, 35:27730–27744, 2022\. - Peng et al\. \[2019\]Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine\.Advantage\-weighted regression: Simple and scalable off\-policy reinforcement learning\.*arXiv preprint arXiv:1910\.00177*, 2019\. - Polykovskiy et al\. \[2020\]Daniil Polykovskiy, Alexander Zhebrak, Benjamin Sanchez\-Lengeling, Sergey Golovanov, Oktai Tatanov, Stanislav Belyaev, Rauf Kurbanov, Aleksey Artamonov, Vladimir Aladinskiy, Mark Veselov, et al\.Molecular sets \(moses\): a benchmarking platform for molecular generation models\.*Frontiers in pharmacology*, 11:565644, 2020\. - Rajeswaran et al\. \[2017\]Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine\.Learning complex dexterous manipulation with deep reinforcement learning and demonstrations\.*arXiv preprint arXiv:1709\.10087*, 2017\. - Riedmiller et al\. \[2018\]Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg\.Learning by playing solving sparse reward tasks from scratch\.In*International conference on machine learning*, pages 4344–4353\. PMLR, 2018\. - Ross et al\. \[2011\]Stéphane Ross, Geoffrey Gordon, and Drew Bagnell\.A reduction of imitation learning and structured prediction to no\-regret online learning\.In*Proceedings of the fourteenth international conference on artificial intelligence and statistics*, pages 627–635, 2011\. - Sanchez\-Lengeling and Aspuru\-Guzik \[2018\]Benjamin Sanchez\-Lengeling and Alán Aspuru\-Guzik\.Inverse molecular design using machine learning: Generative models for matter engineering\.*Science*, 361\(6400\):360–365, 2018\. - Schulman et al\. \[2017\]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\. - Shao et al\. \[2024\]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\. - Singh et al\. \[2023\]Avi Singh, John D Co\-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al\.Beyond human data: Scaling self\-training for problem\-solving with language models\.*arXiv preprint arXiv:2312\.06585*, 2023\. - Stokes et al\. \[2020\]Jonathan M Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos\-Ruiz, Nina M Donghia, Craig R MacNair, Shawn French, Lindsey A Carfrae, Zohar Bloom\-Ackermann, et al\.A deep learning approach to antibiotic discovery\.*Cell*, 180\(4\):688–702, 2020\. - Taori et al\. \[2023\]Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto\.Stanford alpaca: An instruction\-following llama model, 2023\. - Taylor et al\. \[2022\]Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic\.Galactica: A large language model for science\.*arXiv preprint arXiv:2211\.09085*, 2022\. - Wei et al\. \[2022\]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al\.Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in neural information processing systems*, 35:24824–24837, 2022\. - Weininger \[1988\]David Weininger\.Smiles, a chemical language and information system\. 1\. introduction to methodology and encoding rules\.*Journal of chemical information and computer sciences*, 28\(1\):31–36, 1988\. - Yu et al\. \[2023\]Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu\.Metamath: Bootstrap your own mathematical questions for large language models\.*arXiv preprint arXiv:2309\.12284*, 2023\. - Yuan et al\. \[2024\]Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston\.Self\-rewarding language models\.*arXiv preprint arXiv:2401\.10020*, 2024\. - Zelikman et al\. \[2022\]Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman\.Star: Bootstrapping reasoning with reasoning\.*Advances in Neural Information Processing Systems*, 35:15476–15488, 2022\. ## Appendix AImplementation and Reproducibility Details This appendix provides implementation details omitted from the main algorithmic presentation, including the precise RL objective, rollout synchronization, memory bank semantics, canonicalization of guidance targets, and hyperparameter settings\. ### A\.1RL Objective and KL Regularization The RL term in Active\-GRPO follows the same GRPO\-style objective used in our RePO baseline\. For a minibatch of prompt groups, we compute group\-relative advantages A^i,j=ri,j−r¯iσr,i\+ε,r¯i=1G∑j=1Gri,j,\\widehat\{A\}\_\{i,j\}=\\frac\{r\_\{i,j\}\-\\bar\{r\}\_\{i\}\}\{\\sigma\_\{r,i\}\+\\varepsilon\},\\qquad\\bar\{r\}\_\{i\}=\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}r\_\{i,j\},and optimize ℒRL=−1∑i,j,tmi,j,tcmp∑i,j,t\[ρi,j,tA^i,j−βKLDi,j,tKL\]mi,j,tcmp\.\\mathcal\{L\}\_\{\\mathrm\{RL\}\}=\-\\frac\{1\}\{\\sum\_\{i,j,t\}m^\{\\mathrm\{cmp\}\}\_\{i,j,t\}\}\\sum\_\{i,j,t\}\\Big\[\\rho\_\{i,j,t\}\\widehat\{A\}\_\{i,j\}\-\\beta\_\{\\mathrm\{KL\}\}D^\{\\mathrm\{KL\}\}\_\{i,j,t\}\\Big\]m^\{\\mathrm\{cmp\}\}\_\{i,j,t\}\. Hereρi,j,t\\rho\_\{i,j,t\}denotes the usual token\-level policy ratio term used in GRPO\-style updates, andDi,j,tKLD^\{\\mathrm\{KL\}\}\_\{i,j,t\}is the KL regularizer to a frozen reference modelπref\\pi\_\{\\mathrm\{ref\}\}\. ##### KL estimator\. In our implementation, we use the same KL estimator as in the RePO/GRPO training stack used for all baselines and variants\. This choice is held fixed across RePO and Active\-GRPO so that all comparisons differ only in the adaptive guidance and active reference mechanisms\. ### A\.2Rollout Policy, Reference Model, and Synchronization Rollouts are generated from a rollout policy corresponding to the current training policy at the beginning of each rollout phase\. This rollout policy is kept fixed while computing rewards, advantages, and losses for the resulting minibatch, and is then refreshed for the next rollout phase after the model update\. This is the standard lagged\-policy setup used in policy optimization\. The KL reference modelπref\\pi\_\{\\mathrm\{ref\}\}is frozen throughout training\. It is not periodically refreshed, and serves only as a stabilizing reference for KL regularization\. All methods in our comparisons use the same reference\-model treatment\. In the actual implementation, rollout generation is performed through the same inference backend for RePO and Active\-GRPO, with identical decoding settings except where explicitly varied in ablations\. ### A\.3Memory Bank Semantics and Persistence For each training instance, Active\-GRPO maintains a memory bank ℬ\(ci\)=\{\(mℓ,rℓ\)\}ℓ=1\|ℬ\(ci\)\|,\|ℬ\(ci\)\|≤K\.\\mathcal\{B\}\(c\_\{i\}\)=\\\{\(m\_\{\\ell\},r\_\{\\ell\}\)\\\}\_\{\\ell=1\}^\{\|\\mathcal\{B\}\(c\_\{i\}\)\|\},\\qquad\|\\mathcal\{B\}\(c\_\{i\}\)\|\\leq K\.The bank persists across training steps and across epochs: when the same training example reappears, the previously accumulated bank is reused rather than reinitialized\. Algorithmically, this means Active\-GRPO performs persistent best\-so\-far self\-distillation rather than purely local within\-step imitation\. ##### Keying semantics\. The memory bank is indexed by a deterministic identifier of the training instance\. In our experiments, this effectively means one bank per example, rather than matching free\-form instructions by textual similarity\. This avoids ambiguity arising from paraphrased instructions and makes the active reference mechanism well\-defined across repeated visits to the same example\. ##### Promotion and eviction\. A generated moleculem^\\widehat\{m\}is promoted when it exceeds the current reference reward by marginδ\\deltaand satisfies the admissibility predicateQ\(m^;ci\)Q\(\\widehat\{m\};c\_\{i\}\)\. If the bank is already full, the lowest\-reward entry is evicted\. This yields a capacity\-limited best\-so\-far memory bank for each example\. ##### Age\-based eviction\. Unless otherwise stated, the main experiments use persistent capacity\-based storage only: entries are removed only through reward\-based eviction when the bank is full\. ### A\.4Guidance Targets, Canonicalization, and Answer Masking The active guidance target for instanceiiis mi∗=argmaxm∈\{mref,i\}∪ℬ\(ci\)R\(m;ci\)\.m\_\{i\}^\{\*\}=\\arg\\max\_\{m\\in\\\{m\_\{\\mathrm\{ref\},i\}\\\}\\cup\\mathcal\{B\}\(c\_\{i\}\)\}R\(m;c\_\{i\}\)\.All molecules inserted into the memory bank are canonicalized before storage, and the selected active target is canonicalized again before constructing the teacher\-forced answer sequence\. This ensures that the guidance loss is applied to a deterministic SMILES representation for each promoted molecule\. Answer\-level supervision is implemented by masking only the token span corresponding to the final answer molecule: ℒguide\(i\)\(m\)=−∑tlogπθ\(mt∣xi,m<t\)mi,tans∑tmi,tans\+ε\.\\mathcal\{L\}\_\{\\mathrm\{guide\}\}^\{\(i\)\}\(m\)=\-\\frac\{\\sum\_\{t\}\\log\\pi\_\{\\theta\}\(m\_\{t\}\\mid x\_\{i\},m\_\{<t\}\)\\,m^\{\\mathrm\{ans\}\}\_\{i,t\}\}\{\\sum\_\{t\}m^\{\\mathrm\{ans\}\}\_\{i,t\}\+\\varepsilon\}\.Thus, the reasoning trace is preserved as model\-generated context, while supervision is applied only to the answer molecule itself\. ### A\.5Task\-Specific Admissibility PredicateQ\(m^;c\)Q\(\\widehat\{m\};c\) The admissibility predicateQ\(m^;c\)Q\(\\widehat\{m\};c\)is separated from the scalar reward for conceptual and practical reasons\. The rewardR\(m^;c\)R\(\\widehat\{m\};c\)ranks candidates among admissible outputs, whereasQ\(m^;c\)Q\(\\widehat\{m\};c\)enforces task\-specific hard constraints that define whether a candidate is eligible for promotion into the memory bank\. In the molecule editing experiments,Q\(m^;c\)Q\(\\widehat\{m\};c\)is instantiated as a hard minimum\-similarity constraint to the input molecule, computed using Tanimoto similarity between Morgan fingerprints\. The purpose of this separation is to prevent structurally invalid or semantically off\-task candidates from entering the bank merely because they achieve a high scalar reward under one component ofRR\. In other task settings,Q\(m^;c\)Q\(\\widehat\{m\};c\)may instead encode scaffold preservation, substructure constraints, synthesizability filters, or other admissibility conditions\. ### A\.6Adaptive Guidance Regimes The adaptive guidance coefficient is βguide\(ci\)=βmin\+\(βmax−βmin\)σ\(−α\(vtop\(ci\)−vref\(zi\)\)\)\.\\beta\_\{\\mathrm\{guide\}\}\(c\_\{i\}\)=\\beta\_\{\\min\}\+\(\\beta\_\{\\max\}\-\\beta\_\{\\min\}\)\\,\\sigma\\\!\\left\(\-\\alpha\\bigl\(v\_\{\\mathrm\{top\}\}\(c\_\{i\}\)\-v\_\{\\mathrm\{ref\}\}\(z\_\{i\}\)\\bigr\)\\right\)\.The value ofβmin\\beta\_\{\\min\}determines the asymptotic behavior of the algorithm\. Ifβmin=0\\beta\_\{\\min\}=0, then once the policy reliably outperforms the current reference, the algorithm transitions to pure RL in the limit\. Ifβmin\>0\\beta\_\{\\min\}\>0, then Active\-GRPO retains a residual self\-distillation signal toward the current best\-so\-far target, even in late training\. We keep this parameter fixed within each experiment and report its value in the hyperparameter table below\. ### A\.7Sampling and Training Hyperparameters Unless otherwise noted, all methods compared in the main text share the same backbone model, optimizer, rollout budget, reward function, and initialization\. Active\-GRPO differs from RePO only in the addition of the adaptive guidance coefficient and the active reference update mechanism\. The main hyperparameters introduced by Active\-GRPO are: - •kk: number of top samples used invtop\(c\)v\_\{\\mathrm\{top\}\}\(c\); - •α\\alpha: sharpness of the sigmoid switching rule; - •βmin,βmax\\beta\_\{\\min\},\\beta\_\{\\max\}: lower and upper bounds for the guidance coefficient; - •δ\\delta: promotion margin over the current reference reward; - •KK: memory\-bank capacity per training instance\. For reproducibility, we report the exact values used in the main experiments in Table[3](https://arxiv.org/html/2607.00531#A1.T3)\. Table 3:Main hyperparameters used in the matched Active\-GRPO experiments\.HyperparameterValueShared sampling/training parametersNumber of sampled outputsGG3KL coefficientβKL\\beta\_\{\\mathrm\{KL\}\}0\.04Sampling temperature \(base / high\)0\.9 / 1\.3Maximum completion length1024Active\-GRPO\-specific parametersTop\-kksize forvtopv\_\{\\mathrm\{top\}\}k=⌈0\.33G⌉=1k=\\lceil 0\.33G\\rceil=1Guidance lower boundβmin\\beta\_\{\\min\}0\.3Guidance upper boundβmax\\beta\_\{\\max\}1\.5Sigmoid sharpnessα\\alpha3\.0Promotion marginδ\\delta0\.05Memory\-bank capacityKK5Similarity threshold inQQ0\.2 ### A\.8Training Algorithm Algorithm 1Active\-GRPO Training1:Dataset 𝒟=\{\(ci,mref,i\)\}i=1N\\mathcal\{D\}=\\\{\(c\_\{i\},m\_\{\\mathrm\{ref\},i\}\)\\\}\_\{i=1\}^\{N\}, system prompt psysp\_\{\\mathrm\{sys\}\}, policy πθ\\pi\_\{\\theta\}, frozen reference model πref\\pi\_\{\\mathrm\{ref\}\} 2:foreach training stepdo 3:Sample minibatch \{\(ci,mref,i\)\}i=1B\\\{\(c\_\{i\},m\_\{\\mathrm\{ref\},i\}\)\\\}\_\{i=1\}^\{B\}from 𝒟\\mathcal\{D\} 4:for i=1,…,Bi=1,\\dots,Bdo 5:If ℬ\(ci\)\\mathcal\{B\}\(c\_\{i\}\)is uninitialized, set ℬ\(ci\)←\{\(mref,i,vref\(zi\)\)\}\\mathcal\{B\}\(c\_\{i\}\)\\leftarrow\\\{\(m\_\{\\mathrm\{ref\},i\},v\_\{\\mathrm\{ref\}\}\(z\_\{i\}\)\)\\\} 6:Construct prompt xi←\[psys;ci\]x\_\{i\}\\leftarrow\[p\_\{\\mathrm\{sys\}\};c\_\{i\}\] 7:Sample GGoutputs \{oi,j\}j=1G∼πθ\(⋅∣xi\)\\\{o\_\{i,j\}\\\}\_\{j=1\}^\{G\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\_\{i\}\) 8:Extract candidate molecules \{m^i,j\}j=1G\\\{\\widehat\{m\}\_\{i,j\}\\\}\_\{j=1\}^\{G\} 9:Compute rewards ri,j←R\(m^i,j;ci\)r\_\{i,j\}\\leftarrow R\(\\widehat\{m\}\_\{i,j\};c\_\{i\}\) 10:Compute group\-relative advantages A^i,j\\widehat\{A\}\_\{i,j\} 11:Compute vtop\(ci\)v\_\{\\mathrm\{top\}\}\(c\_\{i\}\)from the top\- kkrewards 12:Compute βguide\(ci\)←βmin\+\(βmax−βmin\)σ\(−α\(vtop\(ci\)−vref\(zi\)\)\)\\beta\_\{\\mathrm\{guide\}\}\(c\_\{i\}\)\\leftarrow\\beta\_\{\\min\}\+\(\\beta\_\{\\max\}\-\\beta\_\{\\min\}\)\\sigma\\\!\\left\(\-\\alpha\\bigl\(v\_\{\\mathrm\{top\}\}\(c\_\{i\}\)\-v\_\{\\mathrm\{ref\}\}\(z\_\{i\}\)\\bigr\)\\right\) 13:for j=1,…,Gj=1,\\dots,Gdo 14:if Promote\(m^i,j∣ci\)=1\\mathrm\{Promote\}\(\\widehat\{m\}\_\{i,j\}\\mid c\_\{i\}\)=1then 15:Insert \(canon\(m^i,j\),ri,j\)\(\\mathrm\{canon\}\(\\widehat\{m\}\_\{i,j\}\),r\_\{i,j\}\)into ℬ\(ci\)\\mathcal\{B\}\(c\_\{i\}\), evicting the lowest\-reward entry if necessary 16:endif 17:endfor 18:Set mi∗m\_\{i\}^\{\*\}to the molecule with highest stored reward in ℬ\(ci\)\\mathcal\{B\}\(c\_\{i\}\) 19:Compute answer mask mi,tansm^\{\\mathrm\{ans\}\}\_\{i,t\}and guidance loss ℒguide\(i\)\(mi∗\)\\mathcal\{L\}\_\{\\mathrm\{guide\}\}^\{\(i\)\}\(m\_\{i\}^\{\*\}\) 20:endfor 21:Compute minibatch RL loss ℒRL\\mathcal\{L\}\_\{\\mathrm\{RL\}\} 22:Compute total loss ℒActive−GRPO←ℒRL\+1B∑i=1Bβguide\(ci\)ℒguide\(i\)\(mi∗\)\\mathcal\{L\}\_\{\{Active\-GRPO\}\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{RL\}\}\+\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\beta\_\{\\mathrm\{guide\}\}\(c\_\{i\}\)\\,\\mathcal\{L\}\_\{\\mathrm\{guide\}\}^\{\(i\)\}\(m\_\{i\}^\{\*\}\) 23:Update θ\\thetaby gradient descent on ℒActive−GRPO\\mathcal\{L\}\_\{\{Active\-GRPO\}\} 24:endfor ## Appendix BRelated Work ##### LLM Reasoning via SFT and Reinforcement Learning\. Eliciting structured, multi\-step reasoning from large language models has become a central research thread\. Early work showed that simple prompting strategies such as chain\-of\-thought\[[52](https://arxiv.org/html/2607.00531#bib.bib52),[27](https://arxiv.org/html/2607.00531#bib.bib27)\]can substantially improve performance on tasks requiring deliberation\. Building on this, supervised fine\-tuning on curated reasoning traces\[[56](https://arxiv.org/html/2607.00531#bib.bib56),[54](https://arxiv.org/html/2607.00531#bib.bib54),[36](https://arxiv.org/html/2607.00531#bib.bib36)\]further internalizes step\-by\-step problem solving, while reinforcement learning with verifiable rewards \(RLVR\)—using programmatic checkers as the reward signal—has driven recent advances in mathematical and code reasoning\[[19](https://arxiv.org/html/2607.00531#bib.bib19),[28](https://arxiv.org/html/2607.00531#bib.bib28),[47](https://arxiv.org/html/2607.00531#bib.bib47)\]\. Group Relative Policy Optimization \(GRPO\)\[[47](https://arxiv.org/html/2607.00531#bib.bib47)\]has in particular become a default RLVR algorithm, replacing PPO’s value network with group\-relative advantage estimation\. However, both paradigms exhibit characteristic failure modes that motivate our work: answer\-only SFT can suppress intermediate reasoning and harm generalization\[[10](https://arxiv.org/html/2607.00531#bib.bib10),[33](https://arxiv.org/html/2607.00531#bib.bib33)\], while RLVR struggles with sparse feedback in domains where most samples fail verification gates\[[2](https://arxiv.org/html/2607.00531#bib.bib2),[43](https://arxiv.org/html/2607.00531#bib.bib43)\]—a pathology that is especially acute in instruction\-based molecular optimization, where tight similarity and validity constraints make zero\-reward batches the norm rather than the exception\. Active\-GRPO inherits the RLVR formulation but addresses the sparse\-reward pathology through reference guidance that adapts on a per\-instance basis to the policy’s current capability\. ##### LLMs for Scientific Discovery and Molecular Optimization\. A growing body of work applies LLMs to scientific reasoning, ranging from domain\-specialized pretraining\[[51](https://arxiv.org/html/2607.00531#bib.bib51)\]and surveys of LLM\-driven discovery\[[1](https://arxiv.org/html/2607.00531#bib.bib1)\]to autonomous experimental agents in chemistry\[[7](https://arxiv.org/html/2607.00531#bib.bib7)\]and drug discovery\[[49](https://arxiv.org/html/2607.00531#bib.bib49)\]\. Within this landscape,*instruction\-based molecular optimization*has emerged as a uniquely demanding testbed: given a source molecule and a natural\-language instruction, the model must propose a structurally similar yet property\-improved candidate\[[25](https://arxiv.org/html/2607.00531#bib.bib25),[26](https://arxiv.org/html/2607.00531#bib.bib26),[14](https://arxiv.org/html/2607.00531#bib.bib14),[30](https://arxiv.org/html/2607.00531#bib.bib30)\]\. The TOMG\-Bench suite\[[29](https://arxiv.org/html/2607.00531#bib.bib29)\]formalized this task across editing, optimization, and customized\-generation subtasks, and subsequent work has explored prompting\[[30](https://arxiv.org/html/2607.00531#bib.bib30)\], surrogate\-model integration, and tool\-augmented agents\. A separate line of research evaluates molecular generation against multi\-property oracles using benchmarks such as GuacaMol\[[8](https://arxiv.org/html/2607.00531#bib.bib8)\], MOSES\[[41](https://arxiv.org/html/2607.00531#bib.bib41)\], the Therapeutics Data Commons\[[22](https://arxiv.org/html/2607.00531#bib.bib22)\], and PMO\[[15](https://arxiv.org/html/2607.00531#bib.bib15)\]\. Unlike these oracle\-driven settings, our work targets the instruction\-conditional regime in which reference molecules are provided alongside the instruction, similarity constraints are tight, and the central challenge is*leveraging*—rather than simply imitating—the supplied references\. This regime is precisely where the static\-reference ceiling we identify becomes consequential: a fixed reference is simultaneously the most natural form of guidance and the most direct source of brittleness when its quality varies across instances\. ##### Reference\-Guided and Demonstration\-Augmented Policy Optimization\. Anchoring reinforcement learning to expert demonstrations has a long history\. Behavior cloning provides a pure imitation baseline, while methods that combine demonstrations with RL—DAPG\[[42](https://arxiv.org/html/2607.00531#bib.bib42)\], DQfD\[[21](https://arxiv.org/html/2607.00531#bib.bib21)\], demonstration\-augmented PPO\[[37](https://arxiv.org/html/2607.00531#bib.bib37)\], and advantage\-weighted regression\[[40](https://arxiv.org/html/2607.00531#bib.bib40)\]—use demonstrations to densify reward signal and stabilize exploration in sparse\-reward settings, while DAgger\[[44](https://arxiv.org/html/2607.00531#bib.bib44)\]addresses distribution shift in pure imitation\. In the LLM setting, Reference\-guided Policy Optimization \(RePO\)\[[32](https://arxiv.org/html/2607.00531#bib.bib32)\]—our most direct prior work—adapts the demonstration\-augmented RL idea to instruction\-based molecular optimization by combining a GRPO\-style RL term with a reference\-guidance term that fixes the reasoning trajectory and supervises only the final answer\. A common assumption underlies all of these methods: demonstrations are treated as*fixed targets*, with the policy implicitly assumed never to surpass them\. This assumption is the source of the static\-reference ceiling we characterize in this paper: when references are weak, noisy, or misaligned with the instruction\[[16](https://arxiv.org/html/2607.00531#bib.bib16),[5](https://arxiv.org/html/2607.00531#bib.bib5)\], the imitation signal actively pulls the policy away from better solutions it would otherwise discover\. The closest conceptual precedent for our approach is robust policy improvement\[[35](https://arxiv.org/html/2607.00531#bib.bib35)\], which adaptively blends imitation and RL based on online relative performance against a fixed oracle\. Active\-GRPO departs from this entire family in two key respects, both motivated by our active\-reasoning paradigm: it makes the choice of*when*to imitate per\-instance and policy\-conditional, and—uniquely—it makes*what*to imitate adaptive by replacing the reference itself once the policy surpasses it\. ##### Self\-Improvement and Iterative Refinement in LLMs\. A complementary line of work trains LLMs to improve themselves by generating, filtering, and re\-training on their own outputs\. Expert Iteration\[[3](https://arxiv.org/html/2607.00531#bib.bib3)\]formalized this loop in the RL\-with\-tree\-search setting; STaR\[[56](https://arxiv.org/html/2607.00531#bib.bib56)\]adapted it to reasoning by bootstrapping on self\-generated rationales filtered by answer correctness\. Reinforced Self\-Training \(ReST\)\[[18](https://arxiv.org/html/2607.00531#bib.bib18)\]and ReST\-EM\[[48](https://arxiv.org/html/2607.00531#bib.bib48)\]alternate between sampling, reward\-based filtering, and supervised fine\-tuning on filtered outputs\. RAFT\[[13](https://arxiv.org/html/2607.00531#bib.bib13)\]similarly fine\-tunes on top\-ranked samples under a reward model, and Self\-Rewarding Language Models\[[55](https://arxiv.org/html/2607.00531#bib.bib55)\]let the model serve as its own reward source\. These methods share a key insight: the policy’s own best outputs can serve as a stronger training target than fixed external data, provided the filtering signal is reliable\. Active\-GRPO builds on this insight but contributes a distinct mechanism that bridges this line of work with the reference\-guided one above\. Rather than*discarding*external references in favor of self\-generated ones—or*preserving*them as fixed targets—Active\-GRPO*integrates*the two through an explicit per\-instance comparison: when the dataset reference still outperforms the policy’s own samples, it imitates the reference; when its samples surpass the reference, it self\-improves and promotes its own discovery into the imitation target\. This is the operational form of our active\-reasoning paradigm: the policy actively decides*when*to imitate versus reinforce, and continuously upgrades*what*it imitates\. To our knowledge, Active\-GRPO is the first method to apply adaptive, policy\-conditional reference replacement to instruction\-based molecular optimization, where tight similarity constraints and competing property objectives make the choice between imitation and self\-improvement particularly consequential\. ## Appendix CAdditional Experimental Details This appendix provides additional experimental details and extended results for Section[4](https://arxiv.org/html/2607.00531#S4)\. We include the full per\-metric breakdown, statistical tests, headroom\-conditioned analyses for allMolOptsubtasks, computational overhead, qualitative case studies, hyperparameter sensitivity, hard\-example stress tests, and longer\-horizon single\-seed evaluations\. ### C\.1Full Per\-Metric Results on TOMG\-Bench MolOpt Table[4](https://arxiv.org/html/2607.00531#A3.T4)reports the full per\-metric breakdown for the main TOMG\-BenchMolOptresults\. The main text emphasizes SR×\\timesSim as the standard composite metric, while this table separates the success and similarity terms to expose the optimization–preservation trade\-off\. Table 4:Full per\-metric results on TOMG\-BenchMolOpt\. Trained methods are reported as mean±\\pmstandard error over three seeds; zero\-shot is evaluated once\.MethodLogP SR↑\\uparrowLogP Sim↑\\uparrowMR SR↑\\uparrowMR Sim↑\\uparrowQED SR↑\\uparrowQED Sim↑\\uparrowZero\-shot0\.23080\.73680\.16660\.78850\.13960\.7980GRPO\-only0\.1405±\\pm0\.01500\.8728±\\pm0\.01490\.1069±\\pm0\.01630\.8974±\\pm0\.01520\.0779±\\pm0\.00790\.8997±\\pm0\.0143Iterative SFT0\.2509±\\pm0\.03180\.7493±\\pm0\.01270\.2114±\\pm0\.02440\.8213±\\pm0\.00300\.1539±\\pm0\.01300\.8158±\\pm0\.0090Offline\-strengthened RePO0\.2526±\\pm0\.00180\.7814±\\pm0\.02910\.2206±\\pm0\.00720\.8412±\\pm0\.02500\.1336±\\pm0\.01720\.8524±\\pm0\.0268RePO0\.2368±\\pm0\.00960\.7925±\\pm0\.00230\.2181±\\pm0\.01200\.8534±\\pm0\.00580\.1501±\\pm0\.00440\.8383±\\pm0\.0030Active\-GRPO0\.2613±\\pm0\.01390\.7565±\\pm0\.01260\.2314±\\pm0\.00980\.8221±\\pm0\.01040\.1820±\\pm0\.01340\.7932±\\pm0\.0184 Table[4](https://arxiv.org/html/2607.00531#A3.T4)shows that Active\-GRPO’s gain comes primarily from higher success rates\. RePO and GRPO\-only often preserve higher similarity, but this similarity can reflect conservative or no\-op behavior rather than successful optimization\. This motivates the qualitative analysis in Section[4\.6](https://arxiv.org/html/2607.00531#S4.SS6)\. ### C\.2Statistical Significance We assess statistical significance using a paired bootstrap over per\-example SR×\\timesSim differences\. For each method and subtask, we compute SR×\\timesSim for every evaluation example in each seed, pool the matched per\-example observations across the three seeds, and bootstrap paired indices with 10,000 resamples\. For each resample, we compute the mean difference between Active\-GRPO and the comparison method\. The reported one\-sidedpp\-value is the fraction of bootstrap resamples for which this mean difference is non\-positive\. Table[5](https://arxiv.org/html/2607.00531#A3.T5)reports the resulting SR×\\timesSim differences andpp\-values\. Table 5:Paired\-bootstrap significance tests for SR×\\timesSim\. Differences are Active\-GRPO minus the comparison method\. Bootstrap resampling is performed over matched per\-example observations pooled across the three seeds\.ComparisonLogPΔ\\DeltaLogPppMRΔ\\DeltaMRppQEDΔ\\DeltaQEDppActive\-GRPO vs RePO\+0\.0148<0\.001<0\.001\+0\.0090<0\.001<0\.001\+0\.0206<0\.001<0\.001Active\-GRPO vs Offline\-strengthened RePO\+0\.0115<0\.001<0\.001\+0\.0126<0\.001<0\.001\+0\.0357<0\.001<0\.001Active\-GRPO vs Iterative SFT\+0\.0105<0\.001<0\.001\+0\.0168<0\.001<0\.001\+0\.0204<0\.001<0\.001Active\-GRPO vs GRPO\-only\+0\.0895<0\.001<0\.001\+0\.0953<0\.001<0\.001\+0\.0792<0\.001<0\.001Active\-GRPO vsβ\\beta\-only\+0\.0169<0\.001<0\.001\+0\.0142<0\.001<0\.001\+0\.0235<0\.001<0\.001Active\-GRPO vs bank\-only\+0\.0113<0\.001<0\.001\+0\.0134<0\.001<0\.001\+0\.0188<0\.001<0\.001 ### C\.3Headroom\-Conditional Analysis on All Subtasks In Section[4\.4](https://arxiv.org/html/2607.00531#S4.SS4), we report the headroom\-conditioned analysis in the main text, with the LogP breakdown shown alongside the curve\. Here we provide the complete protocol and the corresponding extended analysis for all threeMolOptsubtasks\. Tables[6](https://arxiv.org/html/2607.00531#A3.T6),[7](https://arxiv.org/html/2607.00531#A3.T7), and[8](https://arxiv.org/html/2607.00531#A3.T8)report the per\-bin results for LogP, MR, and QED, respectively\. For each test example, we compute an optimization\-headroom score from the source molecule’s original property value\. For decrease\-LogP and decrease\-MR tasks, larger original LogP/MR values indicate more room for improvement\. For increase\-QED, we define headroom as1−QED1\-\\mathrm\{QED\}\. We then partition test examples into quintiles and report the Active\-GRPO–RePO gap within each bin\. These bins measure optimization headroom rather than an independent reference\-quality gap; in this benchmark, the source molecule is the static anchor against which optimization is requested\. Table 6:Headroom\-conditioned analysis on LogP\. The Active\-GRPO–RePO gap increases with optimization headroom\.QuintileGap rangeΔ\\DeltaSRAPIAR SR×\\timesSimRePO SR×\\timesSimΔ\\DeltaSR×\\timesSimQ1\(−3\.91,1\.30\]\(\-3\.91,1\.30\]\+0\.0070\.1680\.167\+0\.002Q2\(1\.30,2\.25\]\(1\.30,2\.25\]\+0\.0140\.1730\.160\+0\.007Q3\(2\.25,2\.94\]\(2\.25,2\.94\]\+0\.0240\.1750\.159\+0\.013Q4\(2\.94,3\.63\]\(2\.94,3\.63\]\+0\.0320\.1830\.161\+0\.022Q5\(3\.63,7\.36\]\(3\.63,7\.36\]\+0\.0450\.1890\.157\+0\.031 Table 7:Headroom\-conditioned analysis on MR\. We bucket test examples by the original MR value of the source molecule; larger values indicate more room for improvement under the decrease\-MR objective\.QuintileGap rangeΔ\\DeltaSRAPIAR SR×\\timesSimRePO SR×\\timesSimΔ\\DeltaSR×\\timesSimQ1\(19\.97,74\.51\]\(19\.97,74\.51\]\+0\.0130\.1370\.129\+0\.008Q2\(74\.51,84\.89\]\(74\.51,84\.89\]\+0\.0030\.1520\.149\+0\.003Q3\(84\.89,93\.59\]\(84\.89,93\.59\]\+0\.0140\.1580\.148\+0\.010Q4\(93\.59,102\.92\]\(93\.59,102\.92\]\+0\.0140\.1640\.156\+0\.008Q5\(102\.92,142\.84\]\(102\.92,142\.84\]\+0\.0220\.1760\.161\+0\.015 Table 8:Headroom\-conditioned analysis on QED\. We bucket test examples by1−QED1\-\\mathrm\{QED\}of the source molecule; larger values indicate more room for improvement under the increase\-QED objective\.QuintileGap rangeΔ\\DeltaSRAPIAR SR×\\timesSimRePO SR×\\timesSimΔ\\DeltaSR×\\timesSimQ1\(0\.05,0\.15\]\(0\.05,0\.15\]\+0\.0410\.1290\.102\+0\.027Q2\(0\.15,0\.20\]\(0\.15,0\.20\]\+0\.0160\.1180\.109\+0\.009Q3\(0\.20,0\.27\]\(0\.20,0\.27\]\+0\.0350\.1290\.107\+0\.022Q4\(0\.27,0\.38\]\(0\.27,0\.38\]\+0\.0200\.1130\.102\+0\.012Q5\(0\.38,0\.82\]\(0\.38,0\.82\]\+0\.0470\.1300\.096\+0\.034 Because the test split does not provide a separate gold reference molecule for each example, we do not construct a separate reference\-quality stratification; source\-property headroom is the relevant task\-level proxy for how informative the static source\-reference is for the requested edit\. ### C\.4Computational Overhead Table[9](https://arxiv.org/html/2607.00531#A3.T9)reports matched wall\-clock measurements for RePO and Active\-GRPO\. Both methods use the same model architecture and rollout setup; the additional computations in Active\-GRPO, including dynamic guidance weighting and memory\-bank operations, are lightweight relative to online generation and model optimization\. Table 9:Wall\-clock comparison between RePO and Active\-GRPO\. Timing is measured over matched 30\-step runs on 4×\\timesA100\-SXM4\-40GB with DeepSpeed ZeRO\-3\.MethodTotal timeSec\./stepSamples/sec\.OverheadRePO4720\.8s157\.40\.61baselineActive\-GRPO4657\.5s155\.30\.62−1\.3%\-1\.3\\% ### C\.5Qualitative Case Studies Figure[5](https://arxiv.org/html/2607.00531#A3.F5)shows representative examples where Active\-GRPO succeeds and RePO fails across LogP, MR, and QED tasks\. Each row contains the source molecule, the RePO output, and the Active\-GRPO output, together with similarity and property\-change annotations\. Figure 5:Qualitative case studies across LogP, MR, and QED\. Active\-GRPO makes targeted structural edits that satisfy the requested property objective while preserving much of the input structure\. RePO often either copies the source molecule, yielding high similarity but no property improvement, or produces an invalid or wrong\-direction edit\.##### No\-op failure analysis\. To quantify the copy\-like failure mode observed in the qualitative examples, we measure the no\-op failure rate: NoOpFail=𝟏\{Sim\(m^,msrc\)≥0\.98∧Success=0\}\.\\mathrm\{NoOpFail\}=\\mathbf\{1\}\\\{\\mathrm\{Sim\}\(\\widehat\{m\},m\_\{\\mathrm\{src\}\}\)\\geq 0\.98\\;\\wedge\\;\\mathrm\{Success\}=0\\\}\.Figure[6](https://arxiv.org/html/2607.00531#A3.F6)reports this rate by method and subtask\. AcrossMolOptsubtasks, Active\-GRPO reduces no\-op failures relative to RePO, supporting the qualitative observation that active imitation and RL with active referencing encourages more effective edits rather than simply preserving the source molecule\. Figure 6:No\-op failure rate by method and subtask\. A no\-op failure is defined as an output with high similarity to the source molecule,Sim\(m^,msrc\)≥0\.98\\mathrm\{Sim\}\(\\widehat\{m\},m\_\{\\mathrm\{src\}\}\)\\geq 0\.98, that nevertheless fails the requested property objective\. Lower is better\. ##### Visible reasoning traces\. We further inspect visible rationales emitted before the final SMILES answer\. These are model\-generated text outputs, not hidden chain\-of\-thought\. The examples below show a recurring pattern: RePO often states a plausible chemical strategy, but the final extracted SMILES is unchanged, invalid, or inconsistent with the stated objective\. In contrast, Active\-GRPO more often converts the stated optimization direction into a valid targeted edit\. Example 1: Increasing LogP via hydrophobic substitutionInstruction\.OptimizeFC\(F\)\(F\)c1cccc\(Nc2scc\(\-c3ccc\(Cl\)cc3\)\[n\+\]2CCc2ccccc2\)c1to have a higher LogP value\.Active\-GRPORePOOutputFC\(Cl\)\(Cl\)c1cccc\(\.\.\.\)c1FC\(F\)\(F\)c1cccc\(\.\.\.C\(=O\)\[OH\]\.\.\.\)c1Eval\.Success, sim=0\.873=0\.873Failure, sim=0\.780=0\.780Visible rationale“replace the fluorine atoms with chlorine atoms” to increase hydrophobicitystates that a larger hydrophobic substituent should be introduced, but returns a carboxylic\-acid\-containing editInterpretation\.Active\-GRPO’s rationale and edit are directionally aligned with the target property\. RePO identifies the need for higher hydrophobicity, but the generated molecule introduces a polar acid group and fails the objective\. Example 2: Decreasing LogP by adding polarityInstruction\.ModifyN\#Cc1cc\(F\)ccc1NC\(=O\)c1ccnc\(OCC\(F\)F\)c1to decrease its LogP value\.Active\-GRPORePOOutputN\#Cc1cc\(F\)ccc1NC\(=O\)c1ccnc\(OCC\(F\)O\)c1\\\(N\#Cc1cc\(F\)ccc1NC\(=O\)c1ccnc\(OCC\(F\)F\)c1\\\)Eval\.Success, sim=0\.865=0\.865Invalid, sim=0\.000=0\.000Visible rationaleintroduces a hydroxyl group to increase hydrophilicity and reduce LogPdescribes removing/replacing fluorine atoms, but the final extracted answer retains formatting delimiters and is invalidInterpretation\.Active\-GRPO maps the property direction to a valid polar substitution\. RePO’s rationale is broadly plausible, but the final answer is not a valid molecule\. Example 3: Decreasing MR by simplifying a ring systemInstruction\.ModifyCCN\(CC\)S\(=O\)\(=O\)c1ccc\(OC\)c\(C\(=O\)N2CCC3\(CCCCC3\)CC2\)c1to decrease its MR value\.Active\-GRPORePOOutputCCN\(CC\)S\(=O\)\(=O\)c1ccc\(OC\)c\(C\(=O\)N2CCC3\(CCC3\)CC2\)c1CCN\(CC\)S\(=O\)\(=O\)c1ccc\(OC\)c\(C\(=O\)N2CCC3\(CCCCC3\)CC2\)c1Eval\.Success, sim=0\.940=0\.940Failure, sim=1\.000=1\.000Visible rationalereduces the bulky fused ring component to lower MRstates that the ring should be simplified, but returns the original moleculeInterpretation\.This example illustrates a no\-op failure\. RePO’s natural\-language strategy is reasonable, but the final SMILES is unchanged\. Active\-GRPO makes a conservative structural simplification that satisfies the target\. Example 4: Lowering QED while avoiding a no\-opInstruction\.OptimizeCCc1nc2ccc\(C\(=O\)N\(C\)\[C@@H\]\(C\)c3nc4ccccc4s3\)cc2nc1CCto have a lower QED value\.Active\-GRPORePOOutputCCc1nc2ccc\(C\(=O\)N\(C\)\[C@@H\]\(C\)c3nc4ccccc4s3\)cc2nc1CCSCCc1nc2ccc\(C\(=O\)N\(C\)\[C@@H\]\(C\)c3nc4ccccc4s3\)cc2nc1CCEval\.Success, sim=0\.889=0\.889Failure, sim=1\.000=1\.000Visible rationaleproposes a sulfur\-containing modification and returns a non\-trivial editdescribes simplifying the molecule, but returns the original SMILESInterpretation\.Active\-GRPO preserves high similarity while changing the molecule enough to alter the property\. RePO again produces a no\-op despite describing an intended edit\. Overall, these examples suggest that Active\-GRPO improves not merely by changing output distributions, but by making the final molecular action more consistent with the model’s stated property\-level reasoning\. This qualitative pattern matches the quantitative no\-op analysis: RePO more frequently emits unchanged or invalid molecules on examples where a small targeted edit is required\. ### C\.6Hyperparameter Sensitivity We evaluate the sensitivity of Active\-GRPO to three key hyperparameters: the sigmoid sharpnessα\\alpha, the memory\-bank capacityKK, and the promotion marginδ\\delta\. We use MolOpt\-LogP as the test bed and vary one parameter at a time while holding the others at their default values:α=3\.0\\alpha=3\.0,K=5K=5, andδ=0\.05\\delta=0\.05\. Table[10](https://arxiv.org/html/2607.00531#A3.T10)reports the single\-seed sweep results\. Table 10:Hyperparameter sensitivity on MolOpt\-LogP\. We vary one parameter at a time around the default settingα=3\.0\\alpha=3\.0,K=5K=5,δ=0\.05\\delta=0\.05\.ConfigurationSR↑\\uparrowSim↑\\uparrowSR×\\timesSim↑\\uparrowα=1\.0\\alpha=1\.00\.23500\.79110\.1859α=3\.0\\alpha=3\.0\(default\)0\.26130\.75650\.1977α=5\.0\\alpha=5\.00\.23900\.80040\.1913α=10\.0\\alpha=10\.00\.24760\.78510\.1944K=1K=10\.23360\.82070\.1917K=3K=30\.21120\.81940\.1731K=5K=5\(default\)0\.26130\.75650\.1977K=10K=100\.23720\.79740\.1891δ=0\.01\\delta=0\.010\.25220\.80500\.2030δ=0\.05\\delta=0\.05\(default\)0\.26130\.75650\.1977δ=0\.10\\delta=0\.100\.22440\.81980\.1840δ=0\.20\\delta=0\.200\.25340\.80120\.2030The sweep suggests that Active\-GRPO is not highly sensitive to a single carefully tuned hyperparameter\. Across all configurations, SR×\\timesSim remains in the range 0\.1731–0\.2030\. The default setting is near the top of the sweep, while several nearby settings, especially alternative promotion margins, match or slightly exceed the default\. The lowest score occurs atK=3K=3, but even this setting remains close to the main RePO baseline, indicating that the method is reasonably robust across plausible hyperparameter choices\. ### C\.7Hard\-Example Stress Test We further evaluate RePO and Active\-GRPO on a hard\-example subset derived from ZINC, containing molecules whose initial properties leave substantially larger room for optimization\. Table[11](https://arxiv.org/html/2607.00531#A3.T11)reports results on this subset\. This stress test probes whether active imitation and RL with active referencing remains beneficial when successful edits are rarer and the optimization problem is more difficult\. Table 11:Hard\-example stress test on a ZINC\-derived difficult subset\. Results are mean±\\pmstandard error over three seeds\.SubtaskMethodSR↑\\uparrowSim↑\\uparrowSR×\\timesSim↑\\uparrowLogPRePO0\.1011±\\pm0\.00970\.9207±\\pm0\.00370\.0930±\\pm0\.0085LogPActive\-GRPO0\.1204±\\pm0\.01340\.8859±\\pm0\.02420\.1061±\\pm0\.0089MRRePO0\.0757±\\pm0\.00910\.9479±\\pm0\.00610\.0716±\\pm0\.0081MRActive\-GRPO0\.0807±\\pm0\.01150\.9257±\\pm0\.02090\.0742±\\pm0\.0089QEDRePO0\.0468±\\pm0\.00690\.9538±\\pm0\.00580\.0446±\\pm0\.0063QEDActive\-GRPO0\.0545±\\pm0\.01420\.9299±\\pm0\.03290\.0504±\\pm0\.0113 On this hard subset, both methods become more conservative: success rates are substantially lower than on the standardMolOptsplit, while similarities remain high\. Nevertheless, Active\-GRPO improves SR and SR×\\timesSim over RePO on all three subtasks, suggesting that adaptive guidance remains useful when the optimization problem is more difficult\. ### C\.8Longer\-Horizon Single\-Seed Evaluation We additionally evaluate RePO and Active\-GRPO in a longer\-horizon single\-seed setting using Qwen2\.5\-3B\-Instruct on A100\-80GB GPUs\. We train for 4 epochs on OpenMolIns\-light, with 1500 training examples per task group, and evaluate on TOMG\-Bench using Success Rate \(SR\), Similarity \(Sim\), and SR×\\timesSim\. Table[12](https://arxiv.org/html/2607.00531#A3.T12)reports the complete results\. To evaluate transfer across related objectives, we train one shared policy for the property\-based objectives \(LogP, MR, and QED\) and one shared policy for the structure\-based objectives \(AddComponent, DelComponent, and SubComponent\), rather than training a separate policy for each objective\. We report this evaluation as complementary evidence to the main multi\-seed comparison\. Table 12:Longer\-horizon single\-seed evaluation on TOMG\-Bench\. We train one shared policy for structure\-based objectives and one shared policy for property\-based objectives\. We report Success Rate \(SR\), Similarity \(Sim\), and SR×\\timesSim; higher is better\.Task typeObjectiveMetricZero\-shotRePOActive\-GRPOStructure\-basedoptimizationAddComponentSR0\.1230\.4080\.458Sim0\.5730\.7220\.718SR×\\timesSim0\.0710\.2950\.329DelComponentSR0\.2500\.3390\.389Sim0\.6010\.7520\.754SR×\\timesSim0\.1500\.2550\.293SubComponentSR0\.1510\.5020\.600Sim0\.6570\.7520\.760SR×\\timesSim0\.0990\.3770\.456AvgSR×\\timesSim0\.1070\.3090\.359Property\-basedoptimizationLogPSR0\.3090\.4430\.669Sim0\.6280\.7110\.609SR×\\timesSim0\.1940\.3150\.408MRSR0\.2490\.5050\.574Sim0\.6300\.7090\.600SR×\\timesSim0\.1570\.3580\.345QEDSR0\.2220\.3480\.330Sim0\.6130\.7220\.635SR×\\timesSim0\.1360\.2510\.210AvgSR×\\timesSim0\.1620\.3080\.321The longer\-horizon evaluation also reveals a limitation: under the shared property\-policy setting, Active\-GRPO underperforms RePO on QED in SR×\\timesSim\. Since LogP, MR, and QED are optimized by a single shared policy in this experiment, per\-objective trade\-offs may differ from those obtained under separate per\-property training\. At the same time, Active\-GRPO achieves the best average SR×\\timesSim across the three property objectives in this longer\-horizon setting\. Because this evaluation is single\-seed, we use it as complementary evidence about transfer across related objectives, while relying on the main multi\-seedMolOptcomparison for the primary statistical claim\. ### C\.9Matched MolEdit Structural Optimization We further evaluate RePO and Active\-GRPO on TOMG\-BenchMolEdit, which tests structure\-based molecular editing through AddComponent, DelComponent, and SubComponent objectives\. Both methods are trained under the same matched 4×\\timesA100\-40GB configuration for three seeds, using the same backbone, rollout budget, reward functions, decoding setup, and evaluation pipeline\. We train one shared structure\-editing policy over the threeMolEditobjectives and evaluate each subtask separately\. Table[13](https://arxiv.org/html/2607.00531#A3.T13)reports the full SR, Sim, and SR×\\timesSim breakdown\. Table 13:Matched 3\-seedMolEditstructural\-optimization results\. We train one shared structure\-editing policy over AddComponent, DelComponent, and SubComponent, and report mean±\\pmstandard error over three seeds\.SubtaskMethodSR↑\\uparrowSim↑\\uparrowSR×\\timesSim↑\\uparrowAddComponentRePO0\.2003±\\pm0\.01760\.7579±\\pm0\.00540\.1517±\\pm0\.0125Active\-GRPO0\.2282±\\pm0\.01040\.7515±\\pm0\.01280\.1716±\\pm0\.0098DelComponentRePO0\.1393±\\pm0\.00640\.8661±\\pm0\.01060\.1205±\\pm0\.0041Active\-GRPO0\.1142±\\pm0\.01840\.8927±\\pm0\.02190\.1012±\\pm0\.0137SubComponentRePO0\.3067±\\pm0\.00460\.7901±\\pm0\.00270\.2423±\\pm0\.0042Active\-GRPO0\.3245±\\pm0\.01510\.8056±\\pm0\.01020\.2615±\\pm0\.0139AvgRePO0\.21540\.80470\.1715AvgActive\-GRPO0\.22230\.81660\.1781
Similar Articles
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.
ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents
ReGRPO introduces a reflection-augmented policy optimization framework for tool-using vision-language agents, leveraging structured failure observations and joint optimization of reflection tokens and actions to improve recovery from tool failures, achieving state-of-the-art results on GTA and GAIA benchmarks.
Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization
This paper introduces SGRPO, a policy optimization framework that improves biomolecular generation by incorporating set-level diversity rewards alongside utility. It demonstrates improved utility-diversity trade-offs in tasks such as small-molecule and protein design.
GraphPO: Graph-based Policy Optimization for Reasoning Models
GraphPO is a novel graph-based reinforcement learning framework that represents rollouts as a directed acyclic graph, merging semantically equivalent reasoning paths to reduce redundant exploration and improve credit assignment for large reasoning models.
GAGPO: Generalized Advantage Grouped Policy Optimization
GAGPO proposes a critic-free RL method that uses a non-parametric grouped value proxy for step-level credit assignment in multi-turn agentic tasks, outperforming strong baselines on ALFWorld and WebShop.