Controllable Molecular Generative Foundation Models

arXiv cs.LG Papers

Summary

Proposes CoMole, a controllable molecular generative foundation model using motif-aware graph diffusion and reinforcement learning, achieving superior controllability across materials and drug discovery benchmarks.

arXiv:2605.15354v1 Announce Type: new Abstract: Despite the success of foundation models in language and vision, molecular graph generation still lacks a unified framework for heterogeneous design tasks with reliable controllability. While reinforcement learning (RL) offers a natural post-training mechanism for task-specific optimization, applying it to graph generative models is hindered by the vast atom-wise action spaces and chemically invalid intermediate states. We propose \textbf{Co}ntrollable \textbf{Mole}cular Generative Foundation Models (CoMole), built with a unified motif-aware graph diffusion pipeline. By learning a motif-aware graph space, CoMole transfers pretrained structural priors into controllable generation, where RL optimizes conditional reverse policies over chemically meaningful decisions. We theoretically characterize the bottleneck of atom-level RL and justify motif-aware policy optimization. Across three heterogeneous benchmarks spanning materials and drug discovery, CoMole ranks first in controllability on all nine targets, reduces MAE by up to 48.2% relative to the strongest baselines, and maintains validity above 0.94 without rule-based correction or post-hoc filtering. We further show that CoMole transfers controllability to unseen properties by optimizing only task embeddings with the generator frozen, achieving performance competitive with strong task-specific baselines.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:40 AM

# Controllable Molecular Generative Foundation Models
Source: [https://arxiv.org/html/2605.15354](https://arxiv.org/html/2605.15354)
Yihan Zhu University of Notre Dame yzhu25@nd\.edu &Yuhan Liu University of Notre Dame yliu57@nd\.edu &Weijiang Li University of Notre Dame wli27@nd\.edu &Tengfei Luo University of Notre Dame tluo@nd\.edu &Meng Jiang University of Notre Dame mjiang2@nd\.edu

###### Abstract

Despite the success of foundation models in language and vision, molecular graph generation still lacks a unified framework for heterogeneous design tasks with reliable controllability\. While reinforcement learning \(RL\) offers a natural post\-training mechanism for task\-specific optimization, applying it to graph generative models is hindered by the vast atom\-wise action spaces and chemically invalid intermediate states\. We proposeControllableMolecular Generative Foundation Models \(CoMole\), built with a unified motif\-aware graph diffusion pipeline\. By learning a motif\-aware graph space, CoMole transfers pretrained structural priors into controllable generation, where RL optimizes conditional reverse policies over chemically meaningful decisions\. We theoretically characterize the bottleneck of atom\-level RL and justify motif\-aware policy optimization\. Across three heterogeneous benchmarks spanning materials and drug discovery, CoMole ranks first in controllability on all nine targets, reduces MAE by up to 48\.2% relative to the strongest baselines, and maintains validity above 0\.94 without rule\-based correction or post\-hoc filtering\. We further show that CoMole transfers controllability to unseen properties by optimizing only task embeddings with the generator frozen, achieving performance competitive with strong task\-specific baselines\.

## 1Introduction

Molecular inverse design, generating structures with desired functional properties, is a central challenge in scientific discovery, with applications spanning biomedicine and materials science\. Recent graph diffusion methods have significantly advanced molecular generation\(Vignacet al\.,[2023](https://arxiv.org/html/2605.15354#bib.bib1); Huanget al\.,[2023](https://arxiv.org/html/2605.15354#bib.bib2); Liuet al\.,[2024a](https://arxiv.org/html/2605.15354#bib.bib3)\)\. However, these approaches remain largely task\-specific and offer limited controllability over target properties\(Qinet al\.,[2025](https://arxiv.org/html/2605.15354#bib.bib10); Liuet al\.,[2025](https://arxiv.org/html/2605.15354#bib.bib6)\)\. In language and vision, foundation\-model paradigms have enabled powerful generative systems through large\-scale pretraining \(PT\), supervised fine\-tuning \(SFT\), and reinforcement learning \(RL\)\-based alignment\(Bommasaniet al\.,[2022](https://arxiv.org/html/2605.15354#bib.bib31); Zhanget al\.,[2023](https://arxiv.org/html/2605.15354#bib.bib32)\)\. For molecular inverse design, the abundance of unlabeled chemical data alongside label\-scarce downstream tasks motivates the development of unified molecular generative foundation models to bridge the controllability gap between current methods and practical inverse\-design needs\.

However, instantiating this paradigm with atom\-level graph diffusion exposes a core bottleneck: beyond the trajectory collapse observed in atom\-level diffusion RL\(Dulac\-Arnoldet al\.,[2015](https://arxiv.org/html/2605.15354#bib.bib36); Liuet al\.,[2024b](https://arxiv.org/html/2605.15354#bib.bib7)\), the vast space of low\-level atom\-wise actions is poorly aligned with the chemically meaningful substructures through which chemists typically formulate structure\-property rationales for molecular design\. As illustrated in[Figure˜1](https://arxiv.org/html/2605.15354#S1.F1), atom\-level RL often fails to construct reliable substructures from local edits before it can associate structural changes with property rewards\. This fragility arises because each action must jointly coordinate atom types, bonds, and valence constraints\. A single invalid choice can push the trajectory off the feasible chemical manifold and make later denoising difficult to recover\. For example, the atom\-level variant of our design collapses rule\-free validity from 0\.95 to 0\.07 and worsens gas\-permeability control MAE by 2\.68×\\times\([Table˜4](https://arxiv.org/html/2605.15354#S4.T4)\)\.

To overcome this bottleneck, we reparameterize graph\-diffusion RL with a motif\-aware decision space\. It preserves local structural flexibility while introducing rings, functional groups, and data\-driven motifs as higher\-level decisions\. Since molecular properties depend on substructures and their complex interactions, this space lifts RL from fragile atom\-wise construction paths to chemically meaningful decisions \(e\.g\., attaching a benzene ring\), allowing rewards on generated structures to be credited more directly to property\-relevant actions\. This abstraction also promotes chemical validity by preserving internally coherent substructures and reducing the risk of trajectory corruption from invalid atom\-wise edits\. Beyond using motifs mainly as generation priors\(Jinet al\.,[2019](https://arxiv.org/html/2605.15354#bib.bib4); Konget al\.,[2022](https://arxiv.org/html/2605.15354#bib.bib5)\), our design uses them to stabilize RL optimization toward controllable generation\.

We introduceCoMole, to our knowledge the first family ofControllableMolecular Generative Foundation Models for heterogeneous inverse design\. The central idea is to transfer pretrained structural knowledge into controllable inverse design by optimizing conditional reverse policies via RL over chemically meaningful decisions\. Concretely, CoMole learns a Node Pair Encoding \(NPE\)\-based tokenizer\(Liuet al\.,[2025](https://arxiv.org/html/2605.15354#bib.bib6)\)that preserves singleton atoms and attachment\-level information while merging frequent adjacent units from the pretraining distribution into motif\-aware graph states\. Over this space, the graph diffusion transformer is trained through three stages: PT learns transferable structural priors, SFT introduces conditioning for multiple properties, and RL aligns the conditional reverse diffusion policy with terminal target\-property rewards\. In the main text, we instantiate RL alignment with proximal policy optimization \(PPO\), while alternative policy\-optimization objectives are discussed in Appendix[D\.4\.2](https://arxiv.org/html/2605.15354#A4.SS4.SSS2)\.

We theoretically characterize the structural bottleneck of atom\-level RL and motivate motif\-aware policy optimization\. Empirically, we evaluate CoMole on three materials and drug\-discovery benchmarks spanning numerical and categorical conditions\. Across all nine targets, CoMole ranks first in controllability: on both polymer benchmarks, it reduces MAE by over 44% relative to the best baseline\. On small\-molecule tasks, it reduces FreeSolv MAE by 13\.1% and achieves 1\.0 accuracy on BACE classification\. These gains are achieved while maintaining validity above 0\.94 without rule\-based correction or post\-hoc filtering\. We further show that CoMole transfers controllability to unseen property targets by learning only task embeddings while keeping the generator frozen, achieving performance competitive with baselines trained directly on those targets\. Together, these results suggest that our design learns transferable structure\-property knowledge that supports controllable generation across heterogeneous inverse\-design tasks\.

![Refer to caption](https://arxiv.org/html/2605.15354v1/x1.png)Figure 1:Motif\-aware RL as a key stage in training controllable molecular generative foundation models\.Atom\-level RL over vast, low\-level graph edits suffers trajectory collapse and fragile credit assignment, whereas motif\-aware RL credits terminal rewards to chemically meaningful decisions, stabilizing policy updates\.
## 2Preliminaries

##### Notation\.

Letx∈ℳvalx\\in\\mathcal\{M\}\_\{\\mathrm\{val\}\}be a chemically valid molecular graph andϕ\\phia learned graph tokenizer\. The tokenizer mapsxxto a motif graphz0=ϕ​\(x\)∈𝒵0z\_\{0\}=\\phi\(x\)\\in\\mathcal\{Z\}\_\{0\}, which serves as the motif state for diffusion generation and RL post\-training\. For conditional generation,c=\(k,y⋆\)∈𝒞c=\(k,y^\{\\star\}\)\\in\\mathcal\{C\}denotes the target condition, with task identitykkand target value or labely⋆y^\{\\star\}\. We use𝖣\\mathsf\{D\}to denote the empirical training distribution\.

### 2\.1Motif Graph Tokenization

We follow the Node Pair Encoding \(NPE\) algorithm introduced by DemoDiff\(Liuet al\.,[2025](https://arxiv.org/html/2605.15354#bib.bib6)\)and learn a vocabulary from the pretraining dataset \(Appendix[C\.2](https://arxiv.org/html/2605.15354#A3.SS2)\)\. The nodes of the resulting motif graph are atom\-disjoint vocabulary units, including singleton atoms, preserved ring units, and data\-driven merged substructures as higher\-level motifs\. The graph also stores inter\-motif bond labels and directional attachment\-position labels for lossless reconstruction\. Since motif graphs have variable sizes, we pad them to a fixed number of motif slots and represent each state asz=\(X,E,P,m\),z=\(X,E,P,m\),wheremi=1m\_\{i\}=1indicates that slotiiis active\. The variableXiX\_\{i\}encodes the motif type at slotii\. For active motif pairs,Ei​j=Ej​iE\_\{ij\}=E\_\{ji\}denotes a symmetric categorical bond label, including a no\-bond category\.Pi​jP\_\{ij\}denotes the directional attachment\-position label on motifiito motifjj\. Directed pairs without an attachment use a null attachment\-position label\.mmis sampled once from a size prior and fixed throughout generation \(see Appendix[A\.2](https://arxiv.org/html/2605.15354#A1.SS2)\)\. When no confusion arises, we omitmmand writez=\(X,E,P\)z=\(X,E,P\)\.

### 2\.2Molecular Design with Graph Diffusion Transformers

Given a motif graphz0=\(X0,E0,P0\)z\_\{0\}=\(X\_\{0\},E\_\{0\},P\_\{0\}\), we perform discrete diffusion over motif stateszt=\(Xt,Et,Pt\)z\_\{t\}=\(X\_\{t\},E\_\{t\},P\_\{t\}\)\. The forward noising process isq​\(z1:T∣z0\)=∏t=1Tq​\(zt∣zt−1\),q\(z\_\{1:T\}\\mid z\_\{0\}\)=\\prod\_\{t=1\}^\{T\}q\(z\_\{t\}\\mid z\_\{t\-1\}\),which progressively corrupts motif types, bond types, and attachment\-position labels\. Letqt​\(zt∣z0\)q\_\{t\}\(z\_\{t\}\\mid z\_\{0\}\)denote the induced marginal at steptt\. The reverse process starts from the priorp​\(zT\)p\(z\_\{T\}\)and is parameterized by a graph diffusion transformer:pθ​\(z0:T∣c\)=p​\(zT\)​∏t=1Tp~θ​\(zt−1∣zt,t,c\)\.p\_\{\\theta\}\(z\_\{0:T\}\\mid c\)=p\(z\_\{T\}\)\\prod\_\{t=1\}^\{T\}\\widetilde\{p\}\_\{\\theta\}\(z\_\{t\-1\}\\mid z\_\{t\},t,c\)\.For unconditional pretraining,ccis omitted\. At each reverse step, the denoiser predicts the distributions\(X^0,E^0,P^0\)=fθ​\(zt,t,c\),\(\\hat\{X\}\_\{0\},\\hat\{E\}\_\{0\},\\hat\{P\}\_\{0\}\)=f\_\{\\theta\}\(z\_\{t\},t,c\),which parameterizep~θ​\(zt−1∣zt,t,c\)\\widetilde\{p\}\_\{\\theta\}\(z\_\{t\-1\}\\mid z\_\{t\},t,c\)by first predicting the motif statez0=\(X0,E0,P0\)z\_\{0\}=\(X\_\{0\},E\_\{0\},P\_\{0\}\)\. The model is trained with a masked denoising objective:ℒdiff​\(θ\)=𝔼\(z0,c\)∼𝖣,t∼Unif\(\[T\]\),zt∼qt\(⋅∣z0\)​\[λX​CEX\+λE​CEE\+λP​CEP\]\\mathcal\{L\}\_\{\\mathrm\{diff\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(z\_\{0\},c\)\\sim\\mathsf\{D\},\\ t\\sim\\mathrm\{Unif\}\(\[T\]\),\\ z\_\{t\}\\sim q\_\{t\}\(\\cdot\\mid z\_\{0\}\)\}\\left\[\\lambda\_\{X\}\\mathrm\{CE\}\_\{X\}\+\\lambda\_\{E\}\\mathrm\{CE\}\_\{E\}\+\\lambda\_\{P\}\\mathrm\{CE\}\_\{P\}\\right\]\. HereCEX\\mathrm\{CE\}\_\{X\},CEE\\mathrm\{CE\}\_\{E\}, andCEP\\mathrm\{CE\}\_\{P\}are masked cross\-entropy losses over active motif nodes, inter\-motif edges, and directed motif pairs\. Details are given in Appendix[A\.3](https://arxiv.org/html/2605.15354#A1.SS3)\.

## 3Controllable Molecular Generative Foundation Modeling

Following[Section˜2](https://arxiv.org/html/2605.15354#S2.SS0.SSS0.Px1), CoMole generates molecules by reversing a task\-conditioned motif\-graph diffusion process\.[Section˜3\.1](https://arxiv.org/html/2605.15354#S3.SS1)formulates PPO over the reverse trajectories, and[Section˜3\.2](https://arxiv.org/html/2605.15354#S3.SS2)analyzes why our motif\-aware design improves over atom\-level RL\.

### 3\.1Learning the Reverse Diffusion Process as a Policy

Given a conditioncc, reverse diffusion generates a graph through denoisingzTz\_\{T\}intoz0z\_\{0\}\. Since intermediate noisy graphs are difficult to evaluate reliably with property oracles, we apply rewards only on the terminal graph\. We therefore formulate reverse diffusion as a finite\-horizon terminal\-reward Markov decision process \(MDP\) and optimize the denoising policy over sampled trajectories\.

##### MDP Formulation\.

We treat the reverse process as an MDP with horizonH=TH=T\. Given a target conditioncc, at MDP steph=0,…,T−1h=0,\\dots,T\-1, the statesh=\(zT−h,T−h,c\)s\_\{h\}=\(z\_\{T\-h\},T\-h,c\)contains the current noisy motif graph, timestep, and condition, and the action is the next reverse stateah=zT−h−1a\_\{h\}=z\_\{T\-h\-1\}\. During RL training, each rollout draws a conditionc∼μRLc\\sim\\mu\_\{\\mathrm\{RL\}\}and sampleszT∼p​\(zT\)z\_\{T\}\\sim p\(z\_\{T\}\), starting froms0=\(zT,T,c\)s\_\{0\}=\(z\_\{T\},T,c\)and ending atsT=\(z0,0,c\)s\_\{T\}=\(z\_\{0\},0,c\)\. After the policy samplesaha\_\{h\}, the next state is deterministically updated tosh\+1=\(ah,T−h−1,c\)s\_\{h\+1\}=\(a\_\{h\},T\-h\-1,c\), so the only stochastic decision is the one\-step reverse kernel:

πθ​\(ah∣sh\)=p~θ​\(zT−h−1∣zT−h,T−h,c\)\.\\pi\_\{\\theta\}\(a\_\{h\}\\mid s\_\{h\}\)=\\widetilde\{p\}\_\{\\theta\}\(z\_\{T\-h\-1\}\\mid z\_\{T\-h\},T\-h,c\)\.\(1\)Althoughaha\_\{h\}is a structured motif\-graph action,πθ​\(ah∣sh\)\\pi\_\{\\theta\}\(a\_\{h\}\\mid s\_\{h\}\)factorizes over motif, bond, and attachment\-position variables, allowing exact log\-probability computation for RL\. The explicit factorization is given in Appendix[A\.4](https://arxiv.org/html/2605.15354#A1.SS4)\.

##### Terminal Molecular Reward\.

Letxgen=Dec​\(z0\)x\_\{\\mathrm\{gen\}\}=\\mathrm\{Dec\}\(z\_\{0\}\)be the molecule decoded from the final reverse state\. For taskkk, leto^k​\(x\)\\hat\{o\}\_\{k\}\(x\)be the oracle output and define the target discrepancydc​\(x\):=ℓk​\(o^k​\(x\),y⋆\)d\_\{c\}\(x\):=\\ell\_\{k\}\(\\hat\{o\}\_\{k\}\(x\),y^\{\\star\}\)for valid molecules\. Hereℓk\\ell\_\{k\}is task\-specific, e\.g\., absolute error for regression or absolute probability\-label difference for binary classification\.

We use a terminal reward that combines validity and target satisfaction:

R​\(z0;c\)=wval​rval​\(xgen\)\+\(1−wval\)​rprop​\(xgen;c\),wval∈\[0,1\]\.R\(z\_\{0\};c\)=w\_\{\\mathrm\{val\}\}\\,r\_\{\\mathrm\{val\}\}\(x\_\{\\mathrm\{gen\}\}\)\+\(1\-w\_\{\\mathrm\{val\}\}\)\\,r\_\{\\mathrm\{prop\}\}\(x\_\{\\mathrm\{gen\}\};c\),\\qquad w\_\{\\mathrm\{val\}\}\\in\[0,1\]\.\(2\)Here

rval​\(x\)=\{1,x∈ℳval,−1,x∉ℳval,rprop​\(x;c\)=\{gk​\(dc​\(x\)\),x∈ℳval,0,x∉ℳval\.r\_\{\\mathrm\{val\}\}\(x\)=\\begin\{cases\}1,&x\\in\\mathcal\{M\}\_\{\\mathrm\{val\}\},\\\\ \-1,&x\\notin\\mathcal\{M\}\_\{\\mathrm\{val\}\},\\end\{cases\}\\qquad r\_\{\\mathrm\{prop\}\}\(x;c\)=\\begin\{cases\}g\_\{k\}\(d\_\{c\}\(x\)\),&x\\in\\mathcal\{M\}\_\{\\mathrm\{val\}\},\\\\ 0,&x\\notin\\mathcal\{M\}\_\{\\mathrm\{val\}\}\.\\end\{cases\}\(3\)For regression tasks, we usegk​\(d\)=exp⁡\[−\(d/σk\)2\]g\_\{k\}\(d\)=\\exp\[\-\(d/\\sigma\_\{k\}\)^\{2\}\], withσk\>0\\sigma\_\{k\}\>0\. For binary classification tasks, we setgk​\(d\)=1−dg\_\{k\}\(d\)=1\-d, sorpropr\_\{\\mathrm\{prop\}\}equals the oracle probability assigned to the target label\. ThusR​\(z0;c\)∈\[−wval,1\]R\(z\_\{0\};c\)\\in\[\-w\_\{\\mathrm\{val\}\},1\]\.

##### Policy Optimization\.

Given the rollout distributionμRL\\mu\_\{\\mathrm\{RL\}\}, the RL objective is

J​\(πθ\)=𝔼c∼μRL​𝔼τ=\(zT,…,z0\)∼pπθ\(⋅∣c\)​\[R​\(z0;c\)\]\.J\(\\pi\_\{\\theta\}\)=\\mathbb\{E\}\_\{c\\sim\\mu\_\{\\mathrm\{RL\}\}\}\\mathbb\{E\}\_\{\\tau=\(z\_\{T\},\\ldots,z\_\{0\}\)\\sim p\_\{\\pi\_\{\\theta\}\}\(\\cdot\\mid c\)\}\\big\[R\(z\_\{0\};c\)\\big\]\.\(4\)We optimize \([4](https://arxiv.org/html/2605.15354#S3.E4)\) with PPO initialized from the SFT checkpoint\. Because the reward is terminal\-only andγ=1\\gamma=1, every reverse step in a rollout has returnR​\(z0;c\)R\(z\_\{0\};c\)\. At each MDP stephh, we use an MLP value head to estimate the state valueVψ​\(sh\)V\_\{\\psi\}\(s\_\{h\}\)\. The advantage isAh=R​\(z0;c\)−Vψ​\(sh\)A\_\{h\}=R\(z\_\{0\};c\)\-V\_\{\\psi\}\(s\_\{h\}\)\.

For on\-policy rollouts collected byπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}, the PPO importance ratio is

ρh=πθ​\(ah∣sh\)πθold​\(ah∣sh\)\.\\rho\_\{h\}=\\frac\{\\pi\_\{\\theta\}\(a\_\{h\}\\mid s\_\{h\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(a\_\{h\}\\mid s\_\{h\}\)\}\.\(5\)We maximize the clipped PPO surrogate

𝒥clip​\(θ\)=𝔼​\[∑h=0T−1min⁡\(ρh​Ah,clip​\(ρh,1−ϵ,1\+ϵ\)​Ah\)\],\\mathcal\{J\}\_\{\\mathrm\{clip\}\}\(\\theta\)=\\mathbb\{E\}\\\!\\left\[\\sum\_\{h=0\}^\{T\-1\}\\min\\\!\\Big\(\\rho\_\{h\}A\_\{h\},\\;\\mathrm\{clip\}\(\\rho\_\{h\},1\-\\epsilon,1\+\\epsilon\)A\_\{h\}\\Big\)\\right\],\(6\)In implementation, we optimize the PPO loss with critic loss, entropy bonus, and KL regularization\. Details are given in Appendix[A\.5](https://arxiv.org/html/2605.15354#A1.SS5)\.

### 3\.2Theoretical Analysis

We analyze the RL stage and the motif\-aware decision space\. Letπref\\pi\_\{\\mathrm\{ref\}\}denote the frozen SFT reference policy used for KL regularization,𝒜​\(s\)\\mathcal\{A\}\(s\)the finite reverse action space at statess, andpπ​\(τ∣c\)p\_\{\\pi\}\(\\tau\\mid c\)the trajectory distribution induced by policyπ\\piunder conditioncc\.

#### 3\.2\.1The Role of RL in Condition Control

As in[Section˜3\.1](https://arxiv.org/html/2605.15354#S3.SS1), reverse diffusion defines a finite\-horizon MDP with terminal rewardR​\(z0;c\)R\(z\_\{0\};c\)on the final clean motif graphz0z\_\{0\}\. Following KL\-regularized control and control\-as\-inference formulations\(Todorov,[2006](https://arxiv.org/html/2605.15354#bib.bib11); Levine,[2018](https://arxiv.org/html/2605.15354#bib.bib29)\), for a fixed conditionccwe consider the population objective

𝒥β\(π;c\)=𝔼τ=\(zT,…,z0\)∼pπ\(⋅∣c\)\[R\(z0;c\)−β∑h=0T−1KL\(π\(⋅∣sh\)∥πref\(⋅∣sh\)\)\],β\>0\.\\mathcal\{J\}\_\{\\beta\}\(\\pi;c\)=\\mathbb\{E\}\_\{\\tau=\(z\_\{T\},\\ldots,z\_\{0\}\)\\sim p\_\{\\pi\}\(\\cdot\\mid c\)\}\\left\[R\(z\_\{0\};c\)\-\\beta\\sum\_\{h=0\}^\{T\-1\}\\mathrm\{KL\}\\\!\\left\(\\pi\(\\cdot\\mid s\_\{h\}\)\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\\mid s\_\{h\}\)\\right\)\\right\],\\quad\\beta\>0\.\(7\)This objective is a population\-level analytical reference, while PPO approximates the corresponding policy improvement in practice using sampled rollouts and a learned critic\.

###### Proposition 3\.1\(Regularized Bellman characterization\)\.

For a fixed conditioncc, assume policyπ\\piis supported on the support ofπref\\pi\_\{\\mathrm\{ref\}\}, ensuring the KL term is finite\. LetVh⋆​\(s\)V\_\{h\}^\{\\star\}\(s\)denote the optimal KL\-regularized value from MDP stephh\. At the terminal step, forsT=\(z0,0,c\)s\_\{T\}=\(z\_\{0\},0,c\),VT⋆​\(sT\)=R​\(z0;c\)\.V\_\{T\}^\{\\star\}\(s\_\{T\}\)=R\(z\_\{0\};c\)\.Forh=T−1,…,0h=T\-1,\\dots,0, define

Qh⋆​\(s,a\)\\displaystyle Q\_\{h\}^\{\\star\}\(s,a\)=Vh\+1⋆​\(s′\),s′=\(a,T−h−1,c\),\\displaystyle=V\_\{h\+1\}^\{\\star\}\(s^\{\\prime\}\),\\qquad s^\{\\prime\}=\(a,T\-h\-1,c\),\(8\)Vh⋆​\(s\)\\displaystyle V\_\{h\}^\{\\star\}\(s\)=β​log​∑a∈𝒜​\(s\)πref​\(a∣s\)​exp⁡\(Qh⋆​\(s,a\)β\)\.\\displaystyle=\\beta\\log\\sum\_\{a\\in\\mathcal\{A\}\(s\)\}\\pi\_\{\\mathrm\{ref\}\}\(a\\mid s\)\\exp\\\!\\left\(\\frac\{Q\_\{h\}^\{\\star\}\(s,a\)\}\{\\beta\}\\right\)\.\(9\)Then the maximizer of[Eq\.˜7](https://arxiv.org/html/2605.15354#S3.E7)is

πh⋆​\(a∣s\)=πref​\(a∣s\)​exp⁡\(Qh⋆​\(s,a\)−Vh⋆​\(s\)β\),\\pi\_\{h\}^\{\\star\}\(a\\mid s\)=\\pi\_\{\\mathrm\{ref\}\}\(a\\mid s\)\\exp\\\!\\left\(\\frac\{Q\_\{h\}^\{\\star\}\(s,a\)\-V\_\{h\}^\{\\star\}\(s\)\}\{\\beta\}\\right\),\(10\)with support contained in that of the reference policy, i\.e\.,πh⋆​\(a∣s\)=0\\pi\_\{h\}^\{\\star\}\(a\\mid s\)=0wheneverπref​\(a∣s\)=0\\pi\_\{\\mathrm\{ref\}\}\(a\\mid s\)=0\. The optimal value is

𝔼zT∼p​\(⋅\)​\[V0⋆​\(\(zT,T,c\)\)\]\.\\mathbb\{E\}\_\{z\_\{T\}\\sim p\(\\cdot\)\}\\left\[V\_\{0\}^\{\\star\}\(\(z\_\{T\},T,c\)\)\\right\]\.

The optimal policy is an exponential reweighting of the reference reverse kernel toward actions with larger downstream valueQh⋆Q\_\{h\}^\{\\star\}\. For any two actionsa,a′a,a^\{\\prime\}with positive reference probability,

πh⋆​\(a∣s\)/πref​\(a∣s\)πh⋆​\(a′∣s\)/πref​\(a′∣s\)=exp⁡\(Qh⋆​\(s,a\)−Qh⋆​\(s,a′\)β\)\.\\frac\{\\pi\_\{h\}^\{\\star\}\(a\\mid s\)/\\pi\_\{\\mathrm\{ref\}\}\(a\\mid s\)\}\{\\pi\_\{h\}^\{\\star\}\(a^\{\\prime\}\\mid s\)/\\pi\_\{\\mathrm\{ref\}\}\(a^\{\\prime\}\\mid s\)\}=\\exp\\\!\\left\(\\frac\{Q\_\{h\}^\{\\star\}\(s,a\)\-Q\_\{h\}^\{\\star\}\(s,a^\{\\prime\}\)\}\{\\beta\}\\right\)\.\(11\)Thus, actions with higher downstream value are amplified relative toπref\\pi\_\{\\mathrm\{ref\}\}\. The proof follows by applying a standard Gibbs variational identity at each state, details are given in Appendix[B\.1](https://arxiv.org/html/2605.15354#A2.SS1)\.

###### Corollary 3\.2\(Downstream\-value amplification\)\.

Fix a statess, a reverse stephh, and a subset of actions𝒢⊆𝒜​\(s\)\\mathcal\{G\}\\subseteq\\mathcal\{A\}\(s\)\. Suppose there existb∈ℝb\\in\\mathbb\{R\}andΔ\>0\\Delta\>0such that

Qh⋆​\(s,a\)≥b\+Δfor all​a∈𝒢,Qh⋆​\(s,a\)≤bfor all​a∉𝒢\.Q\_\{h\}^\{\\star\}\(s,a\)\\geq b\+\\Delta\\quad\\text\{for all \}a\\in\\mathcal\{G\},\\qquad Q\_\{h\}^\{\\star\}\(s,a\)\\leq b\\quad\\text\{for all \}a\\notin\\mathcal\{G\}\.LetpG:=πref​\(𝒢∣s\)p\_\{G\}:=\\pi\_\{\\mathrm\{ref\}\}\(\\mathcal\{G\}\\mid s\)\. Then

πh⋆​\(𝒢∣s\)≥eΔ/β​pGeΔ/β​pG\+1−pG\.\\pi\_\{h\}^\{\\star\}\(\\mathcal\{G\}\\mid s\)\\geq\\frac\{e^\{\\Delta/\\beta\}p\_\{G\}\}\{e^\{\\Delta/\\beta\}p\_\{G\}\+1\-p\_\{G\}\}\.\(12\)In particular, high\-value actions with nonzero reference support receive amplified probability mass, while unsupported actions remain unsupported\.

This provides a mechanism by which RL can improve target controllability: KL\-regularized optimization amplifies high\-value denoising choices while staying close to the SFT reference policy\. This amplification requires target\-relevant actions to have nonzero support underπref\\pi\_\{\\mathrm\{ref\}\}\. The readiness analysis in Appendix[D\.4\.3](https://arxiv.org/html/2605.15354#A4.SS4.SSS3)empirically probes this support through repeated sampling from the SFT checkpoint of CoMole\.

#### 3\.2\.2Decision Complexity of Atom\-level and Motif\-aware RL

The reverse actionah=zT−h−1a\_\{h\}=z\_\{T\-h\-1\}is a structured graph\-valued object\. For a representationr∈\{atom,motif\}r\\in\\\{\\mathrm\{atom\},\\mathrm\{motif\}\\\}, write the one\-step action asa=\(a1,…,aMr​\(s\)\),a=\(a\_\{1\},\\dots,a\_\{M\_\{r\}\(s\)\}\),whereMr​\(s\)M\_\{r\}\(s\)is the number of stochastic categorical factors in the reverse kernel at statess\. For motif graphs, these factors include motif types, inter\-motif bonds, and directed attachment positions\. Assume the policy and reference policy factorize as

π\(r\)​\(a∣s\)=∏j=1Mr​\(s\)πj\(r\)​\(aj∣s,a<j\),πref\(r\)​\(a∣s\)=∏j=1Mr​\(s\)πref,j\(r\)​\(aj∣s,a<j\)\.\\pi^\{\(r\)\}\(a\\mid s\)=\\prod\_\{j=1\}^\{M\_\{r\}\(s\)\}\\pi^\{\(r\)\}\_\{j\}\(a\_\{j\}\\mid s,a\_\{<j\}\),\\qquad\\pi^\{\(r\)\}\_\{\\mathrm\{ref\}\}\(a\\mid s\)=\\prod\_\{j=1\}^\{M\_\{r\}\(s\)\}\\pi^\{\(r\)\}\_\{\\mathrm\{ref\},j\}\(a\_\{j\}\\mid s,a\_\{<j\}\)\.\(13\)
###### Proposition 3\.3\(Factorized one\-step KL\)\.

Under[Eq\.˜13](https://arxiv.org/html/2605.15354#S3.E13),

KL\(π\(r\)\(⋅∣s\)∥πref\(r\)\(⋅∣s\)\)=∑j=1Mr​\(s\)𝔼a<j∼π\(r\)\[KL\(πj\(r\)\(⋅∣s,a<j\)∥πref,j\(r\)\(⋅∣s,a<j\)\)\]\.\\mathrm\{KL\}\\\!\\left\(\\pi^\{\(r\)\}\(\\cdot\\mid s\)\\,\\\|\\,\\pi^\{\(r\)\}\_\{\\mathrm\{ref\}\}\(\\cdot\\mid s\)\\right\)=\\sum\_\{j=1\}^\{M\_\{r\}\(s\)\}\\mathbb\{E\}\_\{a\_\{<j\}\\sim\\pi^\{\(r\)\}\}\\\!\\left\[\\mathrm\{KL\}\\\!\\left\(\\pi^\{\(r\)\}\_\{j\}\(\\cdot\\mid s,a\_\{<j\}\)\\,\\\|\\,\\pi^\{\(r\)\}\_\{\\mathrm\{ref\},j\}\(\\cdot\\mid s,a\_\{<j\}\)\\right\)\\right\]\.\(14\)

###### Corollary 3\.4\(KL budget per subdecision\)\.

IfKL\(π\(r\)\(⋅∣s\)∥πref\(r\)\(⋅∣s\)\)≤η,\\mathrm\{KL\}\\\!\\left\(\\pi^\{\(r\)\}\(\\cdot\\mid s\)\\,\\\|\\,\\pi^\{\(r\)\}\_\{\\mathrm\{ref\}\}\(\\cdot\\mid s\)\\right\)\\leq\\eta,then

1Mr​\(s\)∑j=1Mr​\(s\)𝔼a<j∼π\(r\)\[KL\(πj\(r\)\(⋅∣s,a<j\)∥πref,j\(r\)\(⋅∣s,a<j\)\)\]≤ηMr​\(s\)\.\\frac\{1\}\{M\_\{r\}\(s\)\}\\sum\_\{j=1\}^\{M\_\{r\}\(s\)\}\\mathbb\{E\}\_\{a\_\{<j\}\\sim\\pi^\{\(r\)\}\}\\\!\\left\[\\mathrm\{KL\}\\\!\\left\(\\pi^\{\(r\)\}\_\{j\}\(\\cdot\\mid s,a\_\{<j\}\)\\,\\\|\\,\\pi^\{\(r\)\}\_\{\\mathrm\{ref\},j\}\(\\cdot\\mid s,a\_\{<j\}\)\\right\)\\right\]\\leq\\frac\{\\eta\}\{M\_\{r\}\(s\)\}\.\(15\)

Therefore, under the same reference\-KL regularization, a largerMr​\(s\)M\_\{r\}\(s\)leaves less average policy change per categorical factor\. Since atom\-level representations induce largerMr​\(s\)M\_\{r\}\(s\)through atom and atom\-pair decisions, terminal\-reward policy improvement becomes difficult\.

We next compare this decision size under atom\-only and motif\-aware representations\. For a moleculexx, letnatom​\(x\)n\_\{\\mathrm\{atom\}\}\(x\)andnmotif​\(x\)n\_\{\\mathrm\{motif\}\}\(x\)denote its numbers of active atoms and active motifs\. Since motifs partition the atom set,1≤nmotif​\(x\)≤natom​\(x\)1\\leq n\_\{\\mathrm\{motif\}\}\(x\)\\leq n\_\{\\mathrm\{atom\}\}\(x\)\. To compare one\-step reverse decision sizes, define

Latom​\(n\)\\displaystyle L\_\{\\mathrm\{atom\}\}\(n\):=n\+\(n2\)=n2\+n2,Lmotif​\(n\)\\displaystyle=n\+\\binom\{n\}\{2\}=\\frac\{n^\{2\}\+n\}\{2\},\\qquad L\_\{\\mathrm\{motif\}\}\(n\):=n\+\(n2\)\+n​\(n−1\)=3​n2−n2\.\\displaystyle=n\+\\binom\{n\}\{2\}\+n\(n\-1\)=\\frac\{3n^\{2\}\-n\}\{2\}\.\(16\)HereLatomL\_\{\\mathrm\{atom\}\}counts atom\-type and atom\-bond decisions in a standard graph diffusion model, whileLmotifL\_\{\\mathrm\{motif\}\}counts motif\-type, inter\-motif bond, and directed attachment\-position decisions in the motif graph\.

###### Corollary 3\.5\(Motif\-aware reduction of reverse decision size\)\.

Letχ​\(x\):=natom​\(x\)/nmotif​\(x\)\\chi\(x\):=n\_\{\\mathrm\{atom\}\}\(x\)/n\_\{\\mathrm\{motif\}\}\(x\)be the atom\-to\-motif compression ratio\. Then

Lmotif​\(nmotif​\(x\)\)Latom​\(natom​\(x\)\)=3​nmotif​\(x\)2−nmotif​\(x\)natom​\(x\)2\+natom​\(x\)≤3χ​\(x\)2\.\\frac\{L\_\{\\mathrm\{motif\}\}\(n\_\{\\mathrm\{motif\}\}\(x\)\)\}\{L\_\{\\mathrm\{atom\}\}\(n\_\{\\mathrm\{atom\}\}\(x\)\)\}=\\frac\{3n\_\{\\mathrm\{motif\}\}\(x\)^\{2\}\-n\_\{\\mathrm\{motif\}\}\(x\)\}\{n\_\{\\mathrm\{atom\}\}\(x\)^\{2\}\+n\_\{\\mathrm\{atom\}\}\(x\)\}\\leq\\frac\{3\}\{\\chi\(x\)^\{2\}\}\.\(17\)Thus, sufficiently large motif compression yields a substantially smaller one\-step reverse decision size despite the additional attachment\-position channel\.

Combining Corollary[3\.4](https://arxiv.org/html/2605.15354#S3.Thmtheorem4)and Corollary[3\.5](https://arxiv.org/html/2605.15354#S3.Thmtheorem5), motif\-aware actions improve the KL\-regularized optimization geometry by reducing the number of categorical decisions that must be adjusted at each denoising step\. In our implementation, the tokenizer yields an average atom\-to\-motif compression ratio above5\.5×5\.5\\timesfor polymers \(see Appendix[C\.2](https://arxiv.org/html/2605.15354#A3.SS2)\), implying an upper\-bound factor\-count ratio of roughly3/5\.52≈0\.103/5\.5^\{2\}\\approx 0\.10\. The same tokenizer configuration yields weaker compression on small molecules, which is consistent with a relatively modest performance gain on FreeSolv in[Table˜3](https://arxiv.org/html/2605.15354#S4.T3)\.

## 4Experiments

RQ1: We validate the controllable generative power of CoMole against baselines from molecular optimization, generative modeling, and RL\-based graph generation in[Section˜4\.2](https://arxiv.org/html/2605.15354#S4.SS2)\.RQ2: We conduct further analysis to examine CoMole in[Section˜4\.3](https://arxiv.org/html/2605.15354#S4.SS3)\.

### 4\.1Experimental Setup

We evaluate CoMole on three heterogeneous drug and material design task sets, covering both numerical and categorical properties\. For each task set, CoMole is trained jointly across all tasks\. We also includeCoMolew/o RL\\textnormal\{CoMole\}\_\{\\text\{w/o RL\}\}, an ablated variant without RL post\-training, to isolate the effect of RL\. We assess performance across seven metrics covering validity, distribution learning, and condition control\. Full dataset statistics and implementation details are provided in Appendix[C\.1](https://arxiv.org/html/2605.15354#A3.SS1)and Appendix[D](https://arxiv.org/html/2605.15354#A4)\.

##### Datasets

For PT, we construct two unlabeled datasets: 13k real polymers for materials modeling and 10k molecules sampled from MoleculeNet\(Wuet al\.,[2018](https://arxiv.org/html/2605.15354#bib.bib13)\)for drug\-related tasks\. For SFT, RL, and evaluation, we consider three task sets: \(1\) numerical polymer gas permeability conditions \(O2Perm, CO2Perm, N2Perm\)\(Thorntonet al\.,[2012](https://arxiv.org/html/2605.15354#bib.bib14)\); \(2\) numerical polymer density\-functional\-theory \(DFT\) properties \(Eea, Egb, Egc\)\(Xuet al\.,[2023](https://arxiv.org/html/2605.15354#bib.bib15)\); \(3\) a drug\-related task set spanning physical chemistry \(FreeSolv regression\), biophysics \(BACE classification\), and physiology \(BBBP classification\)\(Gaoet al\.,[2022a](https://arxiv.org/html/2605.15354#bib.bib17)\)\. Two additional DFT properties, Ei and EPS, are held out from training for unseen\-target generalization\. We separate gas permeability and DFT properties for polymers since they represent distinct property families\. A six\-target polymer joint\-training analysis is provided in Appendix[D\.4\.1](https://arxiv.org/html/2605.15354#A4.SS4.SSS1)\.

##### Evaluation

We use an 8:1:1 train/validation/test split and evaluate conditional generation on held\-out test conditions\. For each setting, we generate 10,000 samples and report: \(1\) molecular validity \(Validity\); \(2\) internal diversity among the generated examples \(Diversity\); \(3\) fragment\-based similarity with the reference set \(Similarity\); \(4\) Fréchet ChemNet Distance with the reference set \(Distance\)\(Preueret al\.,[2018](https://arxiv.org/html/2605.15354#bib.bib27)\); \(5\)\-\(7\) MAE/Accuracy for the numerical/categorical task conditions \(Property\)\. Lower MAE or higher accuracy indicates stronger model controllability\. We use Random Forests oracles for drug tasks\(Gaoet al\.,[2022a](https://arxiv.org/html/2605.15354#bib.bib17)\)and GRIN for polymer tasks\(Zhuet al\.,[2026](https://arxiv.org/html/2605.15354#bib.bib16)\)\. Oracle analysis is provided in Appendix[D\.4\.4](https://arxiv.org/html/2605.15354#A4.SS4.SSS4)\. We additionally report novelty and uniqueness in Appendix[D\.4\.5](https://arxiv.org/html/2605.15354#A4.SS4.SSS5)\.

##### Baselines

We compare against three families of baselines: \(1\) molecular optimization methods, including Graph\-GA\(Jensen,[2019](https://arxiv.org/html/2605.15354#bib.bib18)\), MARS\(Xieet al\.,[2021](https://arxiv.org/html/2605.15354#bib.bib19)\), JTVAE with Bayesian optimization \(JTVAE\-BO\)\(Jinet al\.,[2018](https://arxiv.org/html/2605.15354#bib.bib20)\), and SMILES\-LSTM with hill climbing \(LSTM\-HC\)\(Brownet al\.,[2019](https://arxiv.org/html/2605.15354#bib.bib21)\); \(2\) graph generative models, including GDSS\(Joet al\.,[2022](https://arxiv.org/html/2605.15354#bib.bib22)\), DiGress\(Vignacet al\.,[2023](https://arxiv.org/html/2605.15354#bib.bib1)\), MOOD\(Leeet al\.,[2023](https://arxiv.org/html/2605.15354#bib.bib23)\), Graph DiT\(Liuet al\.,[2024a](https://arxiv.org/html/2605.15354#bib.bib3)\), and DeFoG\(Qinet al\.,[2025](https://arxiv.org/html/2605.15354#bib.bib10)\); \(3\) RL\-based graph generation methods, including FREED\(Yanget al\.,[2021](https://arxiv.org/html/2605.15354#bib.bib9)\)and GDPO\(Liuet al\.,[2024b](https://arxiv.org/html/2605.15354#bib.bib7)\)\. Baselines are trained in their standard task\-specialized setting\. For task\-set\-level distribution metrics, we report the strongest specialized baseline\.

### 4\.2RQ1: Heterogeneous Conditional Molecular Generation

As shown in[Tables˜1](https://arxiv.org/html/2605.15354#S4.T1),[2](https://arxiv.org/html/2605.15354#S4.T2)and[3](https://arxiv.org/html/2605.15354#S4.T3),CoMoleachieves the best controllability on all nine targets and maintains at least 0\.94 validity even without rule checking\.

Table 1:Heterogeneous Conditional Molecular Generation of 10K Polymers: Results on three numerical properties \(gas permeability for O2, N2, CO2\)\. MAE is calculated between the input conditions and the properties of the generated polymers using oracles\. Best results are inred\.Table 2:Heterogeneous Conditional Molecular Generation of 10K Polymers: Results on three numerical properties \(Eea, Egb, Egc\)\. MAE is calculated between the input conditions and the generated properties\. Best results are inred\.Table 3:Heterogeneous Conditional Molecular Generation of 10K Molecules: Results on one numerical property \(FreeSolv\) and two categorical properties \(BACE, BBBP\)\. MAE/Accuracy is calculated between the input conditions and the generated properties\. Avg\. Rank is obtained by ranking each method per metric and averaging across property metrics\. Best results are inred\.##### Chemical Validity

Reported validity can overstate generative quality when it relies on hard\-coded chemical rules\. For example, graph\-search methods such as GraphGA explicitly discard invalid molecules during mutation and crossover, thereby achieving perfect final validity\. Removing rule checking in the decoding step, DiGress and MOOD suffer substantial validity drops \(e\.g\., 0\.98 to 0\.25\)\. Graph DiT is more robust, maintaining validity above 0\.6, but still remains below rule\-filtered search methods\. GDPO achieves only 0\.2965 validity on the gas\-permeability task set, further highlighting the fragility of atom\-level RL\. In contrast, both CoMole andCoMolew/o RL\\textnormal\{CoMole\}\_\{\\text\{w/o RL\}\}maintain validity above 0\.9 across all three datasets without rule\-based checking, suggesting that the motif\-aware graph space preserves chemical feasibility under both SFT and RL alignment\.

##### Distribution Learning

GraphGA and Graph DiT are strong in\-distribution baselines, whereas several models such as GDSS and MOOD struggle to match the reference distribution\.CoMolew/o RL\\textnormal\{CoMole\}\_\{\\text\{w/o RL\}\}is highly competitive, achieving the best fragment similarity on both polymer tasks \(0\.9630 and 0\.9686\) and the best Fréchet distance on the DFT task set\. RL post\-training shifts generation toward target\-aligned regions, mildly on the polymer task sets but more substantially on the drug task set, where FreeSolv, BACE, and BBBP involve distinct physical\-chemistry, target\-bioactivity, and physiological\-permeability objectives\. This reflects the general controllability\-distribution trade\-off: when structure\-property rationales are weakly correlated across tasks, improved controllability may require larger deviations from the learned generative prior\.

##### Condition Controllability

Graph DiT is a strong conditional generation baseline, and RL\-based methods such as GDPO and FREED are competitive because they directly optimize target\-aligned rewards\. Nevertheless, CoMole achieves the best performance across all targets, with an average MAE reduction of 27% compared to its SFT\-only variantCoMolew/o RL\\textnormal\{CoMole\}\_\{\\text\{w/o RL\}\}, which itself ranks second overall in controllability\. Compared with the best baseline in each setting, CoMole reduces MAE by 44\.8% over FREED on gas\-permeability properties and by 48\.2% over Graph DiT on DFT properties\. On the drug\-related task set, CoMole reduces FreeSolv MAE by over 13% and achieves a BACE accuracy of 1\.0\.

### 4\.3RQ2: Ablation Studies and Model Analysis

##### Atom\-level RL

We replace the motif\-aware graph representation in CoMole andCoMolew/o RL\\textnormal\{CoMole\}\_\{\\text\{w/o RL\}\}with an atom\-level representation on the gas\-permeability task set while keeping all other settings unchanged\. As shown in[Table˜4](https://arxiv.org/html/2605.15354#S4.T4), atom\-level RL fails underwval=0\.1w\_\{\\mathrm\{val\}\}=0\.1: rule\-free validity drops from 0\.95 to 0\.07, fragment similarity drops from 0\.94 to 0\.18, and Avg\. MAE increases from 0\.43 to 1\.57\. Increasing the validity weight towval=0\.5w\_\{\\mathrm\{val\}\}=0\.5partially recovers validity but further worsens controllability, indicating that reward reweighting cannot resolve the bottleneck of atom\-level RL\. The advantage of motif\-aware representations is also visible before RL\. These results support that atom\-level RL is fragile under terminal rewards over long edit sequences, where stable structure\-property relationships cannot be reliably learned\.

Table 4:Atom\-level RL ablation on the gas\-permeability dataset\.wvalw\_\{\\mathrm\{val\}\}is the validity\-reward weight\. Atom\-level RL remains unstable and poorly controllable even with a larger validity weight\.
##### Generalization

We further evaluate whether CoMole can extend controllability to unseen property targets while keeping the generator frozen\. For two held\-out DFT properties, Ei and EPS, we split each into a 0\.8/0\.2 train/test split, using the training split to fit baselines and learn new task embeddings as convex combinations of the trained DFT task embeddings, e\.g\.,eEi=0\.68​eEea\+0\.12​eEgb\+0\.20​eEgc\.e\_\{\\mathrm\{Ei\}\}=0\.68e\_\{\\mathrm\{Eea\}\}\+0\.12e\_\{\\mathrm\{Egb\}\}\+0\.20e\_\{\\mathrm\{Egc\}\}\.As shown in[Table˜5](https://arxiv.org/html/2605.15354#S4.T5), CoMole maintains high validity and achieves controllability competitive with baselines trained directly on these targets\. These results suggest that CoMole captures structure\-property knowledge that transfers across related targets, highlighting its potential for scalable and data\-efficient extension to new properties\.

Table 5:Generalization to unseen DFT targets \(Ei, EPS\)\. With frozen generator and task embeddings composed from existing tasks \(Eea, Egb, Egc\), CoMole achieves competitive performance against task\-specific baselines\.

## 5Related Work

##### Molecular Optimization and Inverse Design

Traditional molecular optimization searches chemical space with genetic algorithms, Monte Carlo tree search, or Bayesian optimization, typically requiring repeated black\-box oracle queries\(Jensen,[2019](https://arxiv.org/html/2605.15354#bib.bib18); Shahriariet al\.,[2016](https://arxiv.org/html/2605.15354#bib.bib24); Gaoet al\.,[2022b](https://arxiv.org/html/2605.15354#bib.bib25)\)\. Recent graph generative models amortize this search by learning to sample molecules under target specifications, with diffusion and flow\-based models such as GDSS, DiGress, Graph DiT, and DeFoG providing strong backbones for molecular graph generation\(Joet al\.,[2022](https://arxiv.org/html/2605.15354#bib.bib22); Vignacet al\.,[2023](https://arxiv.org/html/2605.15354#bib.bib1); Liuet al\.,[2024a](https://arxiv.org/html/2605.15354#bib.bib3); Qinet al\.,[2025](https://arxiv.org/html/2605.15354#bib.bib10)\)\. In contrast, we target heterogeneous molecular inverse design and aim to bridge the controllability gap between existing methods and practical inverse\-design needs\.

##### Reinforcement Learning over Graph Generation

Reinforcement learning provides a natural framework for aligning molecular generators with black\-box objectives\. Early methods optimize construction policies, such as fragment\-based generation in FREED\(Yanget al\.,[2021](https://arxiv.org/html/2605.15354#bib.bib9)\)\. Recent work such as GDPO studies policy\-gradient optimization for graph diffusion and highlights the instability of directly optimizing atom\-level reverse trajectories\(Liuet al\.,[2024b](https://arxiv.org/html/2605.15354#bib.bib7)\)\. Our method performs RL over motif\-aware graph diffusion instead, using chemically meaningful actions to stabilize policy optimization while improving controllability\.

## 6Conclusion

In this work, we introduce CoMole, controllable molecular generative foundation models for heterogeneous inverse design\. CoMole turns pretrained graph\-diffusion priors into controllable generation by optimizing conditional reverse policies via RL over motif\-aware, chemically meaningful decisions\. We theoretically characterize the decision complexity of atom\-level RL and justify motif\-aware policy optimization\. Empirically, across three materials and drug\-discovery benchmarks covering both numerical and categorical conditions, CoMole ranks first in controllability on all nine targets, maintains validity above 0\.94 without rule\-based correction, and generalizes to unseen properties by updating only task embeddings\. These results suggest CoMole as a practical foundation for unified, scalable, and controllable molecular inverse design across heterogeneous property tasks\.

## References

- Optuna: a next\-generation hyperparameter optimization framework\.External Links:1907\.10902,[Link](https://arxiv.org/abs/1907.10902)Cited by:[§D\.1](https://arxiv.org/html/2605.15354#A4.SS1.p3.1)\.
- R\. Bommasani, D\. A\. Hudson, E\. Adeli, R\. Altman, S\. Arora, S\. von Arx, M\. S\. Bernstein, J\. Bohg, A\. Bosselut, E\. Brunskill, E\. Brynjolfsson, S\. Buch, D\. Card, R\. Castellon, N\. Chatterji, A\. Chen, K\. Creel, J\. Q\. Davis, D\. Demszky, C\. Donahue, M\. Doumbouya, E\. Durmus, S\. Ermon, J\. Etchemendy, K\. Ethayarajh, L\. Fei\-Fei, C\. Finn, T\. Gale, L\. Gillespie, K\. Goel, N\. Goodman, S\. Grossman, N\. Guha, T\. Hashimoto, P\. Henderson, J\. Hewitt, D\. E\. Ho, J\. Hong, K\. Hsu, J\. Huang, T\. Icard, S\. Jain, D\. Jurafsky, P\. Kalluri, S\. Karamcheti, G\. Keeling, F\. Khani, O\. Khattab, P\. W\. Koh, M\. Krass, R\. Krishna, R\. Kuditipudi, A\. Kumar, F\. Ladhak, M\. Lee, T\. Lee, J\. Leskovec, I\. Levent, X\. L\. Li, X\. Li, T\. Ma, A\. Malik, C\. D\. Manning, S\. Mirchandani, E\. Mitchell, Z\. Munyikwa, S\. Nair, A\. Narayan, D\. Narayanan, B\. Newman, A\. Nie, J\. C\. Niebles, H\. Nilforoshan, J\. Nyarko, G\. Ogut, L\. Orr, I\. Papadimitriou, J\. S\. Park, C\. Piech, E\. Portelance, C\. Potts, A\. Raghunathan, R\. Reich, H\. Ren, F\. Rong, Y\. Roohani, C\. Ruiz, J\. Ryan, C\. Ré, D\. Sadigh, S\. Sagawa, K\. Santhanam, A\. Shih, K\. Srinivasan, A\. Tamkin, R\. Taori, A\. W\. Thomas, F\. Tramèr, R\. E\. Wang, W\. Wang, B\. Wu, J\. Wu, Y\. Wu, S\. M\. Xie, M\. Yasunaga, J\. You, M\. Zaharia, M\. Zhang, T\. Zhang, X\. Zhang, Y\. Zhang, L\. Zheng, K\. Zhou, and P\. Liang \(2022\)On the opportunities and risks of foundation models\.External Links:2108\.07258,[Link](https://arxiv.org/abs/2108.07258)Cited by:[§1](https://arxiv.org/html/2605.15354#S1.p1.1)\.
- L\. Breiman \(2001\)Random forests\.Mach\. Learn\.45\(1\),pp\. 5–32\.External Links:ISSN 0885\-6125,[Link](https://doi.org/10.1023/A:1010933404324),[Document](https://dx.doi.org/10.1023/A%3A1010933404324)Cited by:[§D\.4\.4](https://arxiv.org/html/2605.15354#A4.SS4.SSS4.p2.1)\.
- N\. Brown, M\. Fiscato, M\. H\. Segler, and A\. C\. Vaucher \(2019\)GuacaMol: benchmarking models for de novo molecular design\.Journal of chemical information and modeling59\(3\),pp\. 1096–1108\.Cited by:[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px3.p1.1)\.
- G\. Dulac\-Arnold, R\. Evans, H\. van Hasselt, P\. Sunehag, T\. Lillicrap, J\. Hunt, T\. Mann, T\. Weber, T\. Degris, and B\. Coppin \(2015\)Deep reinforcement learning in large discrete action spaces\.arXiv preprint arXiv:1512\.07679\.Cited by:[§1](https://arxiv.org/html/2605.15354#S1.p2.1)\.
- W\. Gao, T\. Fu, J\. Sun, and C\. W\. Coley \(2022a\)Sample efficiency matters: a benchmark for practical molecular optimization\.External Links:2206\.12411,[Link](https://arxiv.org/abs/2206.12411)Cited by:[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px1.p1.3),[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px2.p1.1)\.
- W\. Gao, T\. Fu, J\. Sun, and C\. Coley \(2022b\)Sample efficiency matters: a benchmark for practical molecular optimization\.Advances in neural information processing systems35,pp\. 21342–21357\.Cited by:[§5](https://arxiv.org/html/2605.15354#S5.SS0.SSS0.Px1.p1.1)\.
- H\. Huang, L\. Sun, B\. Du, and W\. Lv \(2023\)Conditional diffusion based on discrete graph structures for molecular graph generation\.External Links:2301\.00427,[Link](https://arxiv.org/abs/2301.00427)Cited by:[§1](https://arxiv.org/html/2605.15354#S1.p1.1)\.
- J\. H\. Jensen \(2019\)A graph\-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space\.Chemical science10\(12\),pp\. 3567–3572\.Cited by:[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.15354#S5.SS0.SSS0.Px1.p1.1)\.
- W\. Jin, R\. Barzilay, and T\. Jaakkola \(2018\)Junction tree variational autoencoder for molecular graph generation\.InInternational conference on machine learning,pp\. 2323–2332\.Cited by:[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px3.p1.1)\.
- W\. Jin, R\. Barzilay, and T\. Jaakkola \(2019\)Junction tree variational autoencoder for molecular graph generation\.External Links:1802\.04364,[Link](https://arxiv.org/abs/1802.04364)Cited by:[§1](https://arxiv.org/html/2605.15354#S1.p3.1)\.
- J\. Jo, S\. Lee, and S\. J\. Hwang \(2022\)Score\-based generative modeling of graphs via the system of stochastic differential equations\.InInternational conference on machine learning,pp\. 10362–10383\.Cited by:[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.15354#S5.SS0.SSS0.Px1.p1.1)\.
- X\. Kong, W\. Huang, Z\. Tan, and Y\. Liu \(2022\)Molecule generation by principal subgraph mining and assembling\.Advances in Neural Information Processing Systems35,pp\. 2550–2563\.Cited by:[§1](https://arxiv.org/html/2605.15354#S1.p3.1)\.
- S\. Lee, J\. Jo, and S\. J\. Hwang \(2023\)Exploring chemical space with score\-based out\-of\-distribution generation\.InInternational Conference on Machine Learning,pp\. 18872–18892\.Cited by:[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px3.p1.1)\.
- S\. Levine \(2018\)Reinforcement learning and control as probabilistic inference: tutorial and review\.External Links:1805\.00909,[Link](https://arxiv.org/abs/1805.00909)Cited by:[§3\.2\.1](https://arxiv.org/html/2605.15354#S3.SS2.SSS1.p1.3)\.
- G\. Liu, J\. Chen, Y\. Zhu, M\. Sun, T\. Luo, N\. V\. Chawla, and M\. Jiang \(2025\)Graph diffusion transformers are in\-context molecular designers\.External Links:2510\.08744,[Link](https://arxiv.org/abs/2510.08744)Cited by:[§A\.1](https://arxiv.org/html/2605.15354#A1.SS1.SSS0.Px1.p1.2),[§C\.2\.1](https://arxiv.org/html/2605.15354#A3.SS2.SSS1.p1.1),[§1](https://arxiv.org/html/2605.15354#S1.p1.1),[§1](https://arxiv.org/html/2605.15354#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.15354#S2.SS1.p1.12)\.
- G\. Liu, J\. Xu, T\. Luo, and M\. Jiang \(2024a\)Graph diffusion transformers for multi\-conditional molecular generation\.External Links:2401\.13858,[Link](https://arxiv.org/abs/2401.13858)Cited by:[§1](https://arxiv.org/html/2605.15354#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.15354#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Liu, C\. Du, T\. Pang, C\. Li, M\. Lin, and W\. Chen \(2024b\)Graph diffusion policy optimization\.External Links:2402\.16302,[Link](https://arxiv.org/abs/2402.16302)Cited by:[§D\.4\.2](https://arxiv.org/html/2605.15354#A4.SS4.SSS2.p1.1),[§1](https://arxiv.org/html/2605.15354#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.15354#S5.SS0.SSS0.Px2.p1.1)\.
- R\. Ma and T\. Luo \(2020\)PI1M: a benchmark database for polymer informatics\.Journal of Chemical Information and Modeling60\(10\),pp\. 4684–4690\.Cited by:[§C\.1\.2](https://arxiv.org/html/2605.15354#A3.SS1.SSS2.p1.3)\.
- K\. Preuer, P\. Renz, T\. Unterthiner, S\. Hochreiter, and G\. Klambauer \(2018\)Fréchet chemnet distance: a metric for generative models for molecules in drug discovery\.Journal of Chemical Information and Modeling58\(9\),pp\. 1736–1741\.Note:PMID: 30118593External Links:[Document](https://dx.doi.org/10.1021/acs.jcim.8b00234),[Link](https://doi.org/10.1021/acs.jcim.8b00234),https://doi\.org/10\.1021/acs\.jcim\.8b00234Cited by:[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Qin, M\. Madeira, D\. Thanou, and P\. Frossard \(2025\)DeFoG: discrete flow matching for graph generation\.External Links:2410\.04263,[Link](https://arxiv.org/abs/2410.04263)Cited by:[§1](https://arxiv.org/html/2605.15354#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.15354#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Z\. Ren, J\. Lidard, L\. L\. Ankile, A\. Simeonov, P\. Agrawal, A\. Majumdar, B\. Burchfiel, H\. Dai, and M\. Simchowitz \(2024\)Diffusion policy policy optimization\.External Links:2409\.00588,[Link](https://arxiv.org/abs/2409.00588)Cited by:[3rd item](https://arxiv.org/html/2605.15354#A4.I1.i3.p1.3)\.
- P\. Renz, D\. Van Rompaey, J\. K\. Wegner, S\. Hochreiter, and G\. Klambauer \(2019\)On failure modes in molecule generation and optimization\.Drug Discovery Today: Technologies32\-33,pp\. 55–63\.Note:Artificial IntelligenceExternal Links:ISSN 1740\-6749,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ddtec.2020.09.003),[Link](https://www.sciencedirect.com/science/article/pii/S1740674920300159)Cited by:[§D\.4\.5](https://arxiv.org/html/2605.15354#A4.SS4.SSS5.p2.1)\.
- B\. Shahriari, K\. Swersky, Z\. Wang, R\. P\. Adams, and N\. de Freitas \(2016\)Taking the human out of the loop: a review of bayesian optimization\.Proceedings of the IEEE104\(1\),pp\. 148–175\.External Links:[Document](https://dx.doi.org/10.1109/JPROC.2015.2494218)Cited by:[§5](https://arxiv.org/html/2605.15354#S5.SS0.SSS0.Px1.p1.1)\.
- A\. W\. Thornton, B\. D\. Freeman, and L\. M\. Robeson \(2012\)Polymer gas separation membrane database\.Note:[https://membrane\-australasia\.org/polymer\-gas\-separation\-membrane\-database/](https://membrane-australasia.org/polymer-gas-separation-membrane-database/)Accessed: 2026\-04\-06Cited by:[§C\.1\.2](https://arxiv.org/html/2605.15354#A3.SS1.SSS2.p1.3),[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px1.p1.3)\.
- E\. Todorov \(2006\)Linearly\-solvable markov decision problems\.Advances in neural information processing systems19\.Cited by:[§3\.2\.1](https://arxiv.org/html/2605.15354#S3.SS2.SSS1.p1.3)\.
- A\. Tripp and J\. M\. Hernández\-Lobato \(2023\)Genetic algorithms are strong baselines for molecule generation\.External Links:2310\.09267,[Link](https://arxiv.org/abs/2310.09267)Cited by:[§D\.4\.5](https://arxiv.org/html/2605.15354#A4.SS4.SSS5.p2.1)\.
- C\. Vignac, I\. Krawczuk, A\. Siraudin, B\. Wang, V\. Cevher, and P\. Frossard \(2023\)DiGress: discrete denoising diffusion for graph generation\.External Links:2209\.14734,[Link](https://arxiv.org/abs/2209.14734)Cited by:[§1](https://arxiv.org/html/2605.15354#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.15354#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Wu, B\. Ramsundar, E\. N\. Feinberg, J\. Gomes, C\. Geniesse, A\. S\. Pappu, K\. Leswing, and V\. Pande \(2018\)MoleculeNet: a benchmark for molecular machine learning\.Chemical science9\(2\),pp\. 513–530\.Cited by:[2nd item](https://arxiv.org/html/2605.15354#A3.I1.i2.p1.1),[§C\.1\.2](https://arxiv.org/html/2605.15354#A3.SS1.SSS2.p1.3),[§D\.3](https://arxiv.org/html/2605.15354#A4.SS3.p4.1),[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px1.p1.3)\.
- Y\. Xie, C\. Shi, H\. Zhou, Y\. Yang, W\. Zhang, Y\. Yu, and L\. Li \(2021\)Mars: markov molecular sampling for multi\-objective drug discovery\.arXiv preprint arXiv:2103\.10432\.Cited by:[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px3.p1.1)\.
- C\. Xu, Y\. Wang, and A\. Barati Farimani \(2023\)TransPolymer: a transformer\-based language model for polymer property predictions\.npj Computational Materials9\(1\),pp\. 64\.External Links:[Document](https://dx.doi.org/10.1038/s41524-023-01016-5)Cited by:[§C\.1\.2](https://arxiv.org/html/2605.15354#A3.SS1.SSS2.p1.3),[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px1.p1.3)\.
- S\. Yang, D\. Hwang, S\. Lee, S\. Ryu, and S\. J\. Hwang \(2021\)Hit and lead discovery with explorative rl and fragment\-based molecule generation\.Advances in Neural Information Processing Systems34,pp\. 7924–7936\.Cited by:[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.15354#S5.SS0.SSS0.Px2.p1.1)\.
- X\. Zhang, Y\. Zeng, J\. Zhang, and H\. Li \(2023\)Toward building general foundation models for language, vision, and vision\-language understanding tasks\.External Links:2301\.05065,[Link](https://arxiv.org/abs/2301.05065)Cited by:[§1](https://arxiv.org/html/2605.15354#S1.p1.1)\.
- Y\. Zhu, G\. Liu, E\. Inae, T\. Luo, and M\. Jiang \(2026\)Learning repetition\-invariant representations for polymer informatics\.External Links:2505\.10726,[Link](https://arxiv.org/abs/2505.10726)Cited by:[§D\.4\.4](https://arxiv.org/html/2605.15354#A4.SS4.SSS4.p2.1),[§4\.1](https://arxiv.org/html/2605.15354#S4.SS1.SSS0.Px2.p1.1)\.

## Appendix ADetails on Model

### A\.1Model Architecture

##### Motif\-aware graph\-level tokens

We instantiate the denoiser as a motif\-aware DiT\-style transformer over padded motif\-graph stateszt=\(Xt,Et,Pt,m\)z\_\{t\}=\(X\_\{t\},E\_\{t\},P\_\{t\},m\)\. Following the graph\-level token construction in DemoDiff\[Liuet al\.,[2025](https://arxiv.org/html/2605.15354#bib.bib6)\], each transformer token corresponds to one motif slot and contains both its motif identity and its row\-wise relation profile\. For an active motif slotii, the initial token is

ui=gX​\(Xt,i\)​‖gE​\(Et,i:\)‖​gP​\(Pt,i:\),u\_\{i\}=g\_\{X\}\(X\_\{t,i\}\)\\;\\\|\\;g\_\{E\}\(E\_\{t,i:\}\)\\;\\\|\\;g\_\{P\}\(P\_\{t,i:\}\),wheregXg\_\{X\}is a motif\-type embedding, andgE,gPg\_\{E\},g\_\{P\}are linear projections of the vectorized one\-hot rows of inter\-motif bond labels and directed attachment\-position labels\. The fixed padded slot order is used when vectorizingEt,i:E\_\{t,i:\}andPt,i:P\_\{t,i:\}, so thejj\-th block always corresponds to the relation from motif slotiito motif slotjj\. Thus, pair\-specific topology is injected before self\-attention rather than pooled into an unordered summary\. Null labels are used for non\-bonded pairs and missing attachment positions, and entries involving padded slots are masked out\.

##### Transformer backbone and conditioning

The backbone is a stack of motif\-slot self\-attention blocks with DiT\-style adaptive LayerNorm\. The condition consists of a task identity and a target value or label: the task identity is represented by a learned embedding, the target is encoded by a small MLP, and the two are fused into a condition vector\. This condition vector is added to the timestep embedding and used to modulate each transformer block\. For unconditional pretraining, the condition vector is omitted\. Since bond and attachment\-position information is already included in the graph\-level motif tokens, we do not add a separate edge bias to self\-attention\.

##### Output heads

Lethih\_\{i\}be the final hidden representation of motif slotii, split as

hi=hiX​‖hiE‖​hiP\.h\_\{i\}=h\_\{i\}^\{X\}\\\|h\_\{i\}^\{E\}\\\|h\_\{i\}^\{P\}\.The three parts predict motif identities, bond rows, and attachment\-position rows through residual row\-wise heads:

ℓiX=WX​hiX\+onehot​\(Xt,i\),\\ell\_\{i\}^\{X\}=W\_\{X\}h\_\{i\}^\{X\}\+\\mathrm\{onehot\}\(X\_\{t,i\}\),L~iE=reshape⁡\(WE​hiE\)\+onehot​\(Et,i:\),ℓi​jE=12​\(L~iE​\[j\]\+L~jE​\[i\]\),\\widetilde\{L\}\_\{i\}^\{E\}=\\operatorname\{reshape\}\(W\_\{E\}h\_\{i\}^\{E\}\)\+\\mathrm\{onehot\}\(E\_\{t,i:\}\),\\qquad\\ell\_\{ij\}^\{E\}=\\frac\{1\}\{2\}\\left\(\\widetilde\{L\}\_\{i\}^\{E\}\[j\]\+\\widetilde\{L\}\_\{j\}^\{E\}\[i\]\\right\),and

LiP=reshape⁡\(WP​hiP\)\+onehot​\(Pt,i:\),ℓi​jP=LiP​\[j\]\.L\_\{i\}^\{P\}=\\operatorname\{reshape\}\(W\_\{P\}h\_\{i\}^\{P\}\)\+\\mathrm\{onehot\}\(P\_\{t,i:\}\),\\qquad\\ell\_\{ij\}^\{P\}=L\_\{i\}^\{P\}\[j\]\.The bond logits are symmetrized becauseEi​j=Ej​iE\_\{ij\}=E\_\{ji\}, while the attachment\-position logits remain directional becausePi​jP\_\{ij\}andPj​iP\_\{ji\}encode different attachment positions\. All logits involving padded slots are masked out during training, sampling, and PPO log\-probability computation\. The resulting logits parameterize the endpoint\-state predictions\(X^0,E^0,P^0\)\(\\hat\{X\}\_\{0\},\\hat\{E\}\_\{0\},\\hat\{P\}\_\{0\}\), which are converted into reverse\-step probabilities through thex0x\_\{0\}\-prediction posterior\.

### A\.2Mask handling

All main\-text distributions are conditional on a fixed node maskmm\. Ifmmis not externally specified at sampling time, the full generative model can be written as

pθ​\(z0:T,m∣c\)=ν​\(m\)​p​\(zT∣m\)​∏t=1Tp~θ​\(zt−1∣zt,t,c,m\),p\_\{\\theta\}\(z\_\{0:T\},m\\mid c\)=\\nu\(m\)\\,p\(z\_\{T\}\\mid m\)\\prod\_\{t=1\}^\{T\}\\widetilde\{p\}\_\{\\theta\}\(z\_\{t\-1\}\\mid z\_\{t\},t,c,m\),\(18\)whereν\\nuis a prior over masks \(equivalently, over graph sizes\)\. The main text suppresses this conditioning for readability\.

### A\.3Masked denoising loss

LetNmaxN\_\{\\max\}denote the fixed maximum number of motif nodes after padding\. For a maskmm, define the active node, undirected pair, and directed pair index sets

ΩX​\(m\)=\{i∈\[Nmax\]:mi=1\},\\Omega\_\{X\}\(m\)=\\\{i\\in\[N\_\{\\max\}\]:m\_\{i\}=1\\\},ΩE​\(m\)=\{\(i,j\):1≤i<j≤Nmax,mi​mj=1\},\\Omega\_\{E\}\(m\)=\\\{\(i,j\):1\\leq i<j\\leq N\_\{\\max\},\\;m\_\{i\}m\_\{j\}=1\\\},and

ΩP​\(m\)=\{\(i,j\):i≠j,mi​mj=1\}\.\\Omega\_\{P\}\(m\)=\\\{\(i,j\):i\\neq j,\\;m\_\{i\}m\_\{j\}=1\\\}\.HereΩE​\(m\)\\Omega\_\{E\}\(m\)indexes undirected inter\-motif bond variables, whileΩP​\(m\)\\Omega\_\{P\}\(m\)indexes directional attachment\-position variables\. For directed pairs without an attachment,Pi​jP\_\{ij\}is assigned a null attachment\-position label\.

Then

CEX\\displaystyle\\mathrm\{CE\}\_\{X\}=−1\|ΩX​\(m\)\|​∑i∈ΩX​\(m\)∑a=1dXX0,i​a​log⁡X^0,i​a,\\displaystyle=\-\\frac\{1\}\{\|\\Omega\_\{X\}\(m\)\|\}\\sum\_\{i\\in\\Omega\_\{X\}\(m\)\}\\sum\_\{a=1\}^\{d\_\{X\}\}X\_\{0,ia\}\\log\\hat\{X\}\_\{0,ia\},\(19\)CEE\\displaystyle\\mathrm\{CE\}\_\{E\}=−1\|ΩE​\(m\)\|​∑\(i,j\)∈ΩE​\(m\)∑b=1dEE0,i​j​b​log⁡E^0,i​j​b,\\displaystyle=\-\\frac\{1\}\{\|\\Omega\_\{E\}\(m\)\|\}\\sum\_\{\(i,j\)\\in\\Omega\_\{E\}\(m\)\}\\sum\_\{b=1\}^\{d\_\{E\}\}E\_\{0,ijb\}\\log\\hat\{E\}\_\{0,ijb\},\(20\)CEP\\displaystyle\\mathrm\{CE\}\_\{P\}=−1\|ΩP​\(m\)\|​∑\(i,j\)∈ΩP​\(m\)∑p=1dPP0,i​j​p​log⁡P^0,i​j​p\.\\displaystyle=\-\\frac\{1\}\{\|\\Omega\_\{P\}\(m\)\|\}\\sum\_\{\(i,j\)\\in\\Omega\_\{P\}\(m\)\}\\sum\_\{p=1\}^\{d\_\{P\}\}P\_\{0,ijp\}\\log\\hat\{P\}\_\{0,ijp\}\.\(21\)IfΩE​\(m\)\\Omega\_\{E\}\(m\)orΩP​\(m\)\\Omega\_\{P\}\(m\)is empty, the corresponding term is defined as0\.

### A\.4Factorized reverse kernel and one\-step log\-probability

Conditioned onztz\_\{t\},tt,cc, and the fixed maskmm, we use a factorized categorical reverse kernel

p~θ​\(zt−1∣zt,t,c,m\)\\displaystyle\\widetilde\{p\}\_\{\\theta\}\(z\_\{t\-1\}\\mid z\_\{t\},t,c,m\)=∏i∈ΩX​\(m\)pθ,tX​\(Xt−1,i∣zt,c,m\)​∏\(i,j\)∈ΩE​\(m\)pθ,tE​\(Et−1,i​j∣zt,c,m\)\\displaystyle=\\prod\_\{i\\in\\Omega\_\{X\}\(m\)\}p\_\{\\theta,t\}^\{X\}\\\!\\left\(X\_\{t\-1,i\}\\mid z\_\{t\},c,m\\right\)\\prod\_\{\(i,j\)\\in\\Omega\_\{E\}\(m\)\}p\_\{\\theta,t\}^\{E\}\\\!\\left\(E\_\{t\-1,ij\}\\mid z\_\{t\},c,m\\right\)×∏\(i,j\)∈ΩP​\(m\)pθ,tP\(Pt−1,i​j∣zt,c,m\)\.\\displaystyle\\qquad\\times\\prod\_\{\(i,j\)\\in\\Omega\_\{P\}\(m\)\}p\_\{\\theta,t\}^\{P\}\\\!\\left\(P\_\{t\-1,ij\}\\mid z\_\{t\},c,m\\right\)\.\(22\)ThePP\-factor is defined over all active directed motif pairs\. For pairs without an attachment, the corresponding categorical outcome is the null class0\.

Accordingly, forah=zt−1a\_\{h\}=z\_\{t\-1\}witht=T−ht=T\-h, the one\-step log\-probability used in PPO is

log⁡πθ​\(ah∣sh\)\\displaystyle\\log\\pi\_\{\\theta\}\(a\_\{h\}\\mid s\_\{h\}\)=∑i∈ΩX​\(m\)log⁡pθ,tX​\(Xt−1,i∣zt,c,m\)\+∑\(i,j\)∈ΩE​\(m\)log⁡pθ,tE​\(Et−1,i​j∣zt,c,m\)\\displaystyle=\\sum\_\{i\\in\\Omega\_\{X\}\(m\)\}\\log p\_\{\\theta,t\}^\{X\}\\\!\\left\(X\_\{t\-1,i\}\\mid z\_\{t\},c,m\\right\)\+\\sum\_\{\(i,j\)\\in\\Omega\_\{E\}\(m\)\}\\log p\_\{\\theta,t\}^\{E\}\\\!\\left\(E\_\{t\-1,ij\}\\mid z\_\{t\},c,m\\right\)\+∑\(i,j\)∈ΩP​\(m\)log⁡pθ,tP​\(Pt−1,i​j∣zt,c,m\)\.\\displaystyle\\qquad\+\\sum\_\{\(i,j\)\\in\\Omega\_\{P\}\(m\)\}\\log p\_\{\\theta,t\}^\{P\}\\\!\\left\(P\_\{t\-1,ij\}\\mid z\_\{t\},c,m\\right\)\.\(23\)If the network predicts endpoint\-state distributions\(X^0,E^0,P^0\)\(\\hat\{X\}\_\{0\},\\hat\{E\}\_\{0\},\\hat\{P\}\_\{0\}\), the factors above denote the correspondingzt−1z\_\{t\-1\}\-probabilities induced by the reverse posterior parameterization\.

### A\.5PPO optimization details

##### Design rationale\.

The KL\-regularized objective in[Eq\.˜7](https://arxiv.org/html/2605.15354#S3.E7)characterizes the desired policy improvement at the population level\. In translating this objective into a practical PPO implementation, we use a standard surrogate decomposition: the critic is trained only against the terminal task reward, while KL regularization toward the frozen SFT reference policy is imposed as an explicit actor\-side penalty\. Specifically, the Monte Carlo return used for value learning is

Gh=R​\(z0;c\),h=0,…,T−1\.G\_\{h\}=R\(z\_\{0\};c\),\\qquad h=0,\\ldots,T\-1\.This keeps the value target tied to the fixed terminal reward function rather than to a policy\-dependent KL cost that would change as the actor is updated\. The KL coefficientcklc\_\{\\mathrm\{kl\}\}in the actor loss therefore plays the empirical role of the regularization strength in the population objective, although it is not in one\-to\-one correspondence withβ\\betabecause PPO uses clipped importance ratios, finite on\-policy batches, and normalized advantages\. Thus, the implemented PPO update is best interpreted as a tractable regularized policy\-improvement surrogate for[Eq\.˜7](https://arxiv.org/html/2605.15354#S3.E7), prioritizing stable value estimation and computational feasibility over exact dynamic\-programming correspondence\.

We initialize RL from the conditional diffusion model after SFT and denote the frozen reference policy by

πref:=πθref\.\\pi\_\{\\mathrm\{ref\}\}:=\\pi\_\{\\theta\_\{\\mathrm\{ref\}\}\}\.\(24\)Letθold\\theta\_\{\\mathrm\{old\}\}denote the behavior\-policy parameters used to collect the current on\-policy batch\.

With a value estimatorVψ​\(sh\)V\_\{\\psi\}\(s\_\{h\}\), the advantage is

Ah=Gh−Vψ​\(sh\)\.A\_\{h\}=G\_\{h\}\-V\_\{\\psi\}\(s\_\{h\}\)\.\(25\)
The PPO importance ratio is

ρh=πθ​\(ah∣sh\)πθold​\(ah∣sh\)\.\\rho\_\{h\}=\\frac\{\\pi\_\{\\theta\}\(a\_\{h\}\\mid s\_\{h\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(a\_\{h\}\\mid s\_\{h\}\)\}\.\(26\)The clipped actor loss, corresponding to the negative of the clipped surrogate in[Eq\.˜6](https://arxiv.org/html/2605.15354#S3.E6), is

ℒclip=−1T​∑h=0T−1𝔼​\[min⁡\(ρh​Ah,clip⁡\(ρh,1−ϵ,1\+ϵ\)​Ah\)\]\.\\mathcal\{L\}\_\{\\mathrm\{clip\}\}=\-\\frac\{1\}\{T\}\\sum\_\{h=0\}^\{T\-1\}\\mathbb\{E\}\\\!\\left\[\\min\\\!\\Bigl\(\\rho\_\{h\}A\_\{h\},\\;\\operatorname\{clip\}\(\\rho\_\{h\},1\-\\epsilon,1\+\\epsilon\)A\_\{h\}\\Bigr\)\\right\]\.\(27\)
The value loss, entropy bonus, and KL regularization toward the frozen reference policy are

ℒvalue=1T​∑h=0T−1𝔼​\[\(Vψ​\(sh\)−Gh\)2\],\\mathcal\{L\}\_\{\\mathrm\{value\}\}=\\frac\{1\}\{T\}\\sum\_\{h=0\}^\{T\-1\}\\mathbb\{E\}\\\!\\left\[\\bigl\(V\_\{\\psi\}\(s\_\{h\}\)\-G\_\{h\}\\bigr\)^\{2\}\\right\],\(28\)ℒent=1T∑h=0T−1𝔼\[ℋ\(πθ\(⋅∣sh\)\)\],\\mathcal\{L\}\_\{\\mathrm\{ent\}\}=\\frac\{1\}\{T\}\\sum\_\{h=0\}^\{T\-1\}\\mathbb\{E\}\\\!\\left\[\\mathcal\{H\}\\bigl\(\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{h\}\)\\bigr\)\\right\],\(29\)ℒKL=1T∑h=0T−1𝔼\[KL\(πθ\(⋅∣sh\)∥πref\(⋅∣sh\)\)\]\.\\mathcal\{L\}\_\{\\mathrm\{KL\}\}=\\frac\{1\}\{T\}\\sum\_\{h=0\}^\{T\-1\}\\mathbb\{E\}\\\!\\left\[\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{h\}\)\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\\mid s\_\{h\}\)\\right\)\\right\]\.\(30\)
The total PPO loss minimized in practice is

ℒPPO=ℒclip\+cv​ℒvalue−ce​ℒent\+ckl​ℒKL,\\mathcal\{L\}\_\{\\mathrm\{PPO\}\}=\\mathcal\{L\}\_\{\\mathrm\{clip\}\}\+c\_\{v\}\\,\\mathcal\{L\}\_\{\\mathrm\{value\}\}\-c\_\{e\}\\,\\mathcal\{L\}\_\{\\mathrm\{ent\}\}\+c\_\{\\mathrm\{kl\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{KL\}\},\(31\)whereϵ\>0\\epsilon\>0is the clipping radius andcv,ce,ckl≥0c\_\{v\},c\_\{e\},c\_\{\\mathrm\{kl\}\}\\geq 0are loss weights\. All expectations are taken overc∼μRLc\\sim\\mu\_\{\\mathrm\{RL\}\}and trajectoriesτ∼pπθold\(⋅∣c\)\\tau\\sim p\_\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\}\(\\cdot\\mid c\)\. In practice, advantages are normalized within each PPO batch\.

## Appendix BDetails on Theory

Throughout this appendix, we condition on a fixed target conditionccand a fixed maskmm, withmmsuppressed in the main notation\. We denote byπref\\pi\_\{\\mathrm\{ref\}\}the frozen reverse policy induced by the SFT checkpoint, which serves as the KL reference policy during PPO\. All state, action, and trajectory spaces are finite becauseNmaxN\_\{\\max\},dXd\_\{X\},dEd\_\{E\},dPd\_\{P\}, andTTare finite\. If the mask is sampled, the same derivations apply after treatingmmas part of the state\. All policies are assumed to be absolutely continuous with respect to the reference policy at each state, i\.e\.,

π\(⋅∣s\)≪πref\(⋅∣s\),\\pi\(\\cdot\\mid s\)\\ll\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\\mid s\),so that the KL terms are well\-defined\.

### B\.1A Gibbs variational identity used in the Bellman proof

###### Lemma B\.1\(Gibbs variational identity\)\.

Let𝒜\\mathcal\{A\}be a finite action set, letπref\\pi\_\{\\mathrm\{ref\}\}be a probability distribution on𝒜\\mathcal\{A\}, letQ:𝒜→ℝQ:\\mathcal\{A\}\\to\\mathbb\{R\}, and letβ\>0\\beta\>0\. Then

supπ≪πref\{∑a∈𝒜π​\(a\)​Q​\(a\)−β​KL​\(π∥πref\)\}=β​log​∑a∈𝒜πref​\(a\)​exp⁡\(Q​\(a\)/β\)\.\\sup\_\{\\pi\\ll\\pi\_\{\\mathrm\{ref\}\}\}\\left\\\{\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\)Q\(a\)\-\\beta\\,\\mathrm\{KL\}\(\\pi\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\)\\right\\\}=\\beta\\log\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\_\{\\mathrm\{ref\}\}\(a\)\\exp\(Q\(a\)/\\beta\)\.\(32\)The unique maximizer is

π⋆​\(a\)=πref​\(a\)​exp⁡\(Q​\(a\)/β\)∑a′∈𝒜πref​\(a′\)​exp⁡\(Q​\(a′\)/β\)\.\\pi^\{\\star\}\(a\)=\\frac\{\\pi\_\{\\mathrm\{ref\}\}\(a\)\\exp\(Q\(a\)/\\beta\)\}\{\\sum\_\{a^\{\\prime\}\\in\\mathcal\{A\}\}\\pi\_\{\\mathrm\{ref\}\}\(a^\{\\prime\}\)\\exp\(Q\(a^\{\\prime\}\)/\\beta\)\}\.\(33\)

###### Proof\.

Define

Z:=∑a∈𝒜πref​\(a\)​exp⁡\(Q​\(a\)/β\)Z:=\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\_\{\\mathrm\{ref\}\}\(a\)\\exp\(Q\(a\)/\\beta\)and

π⋆​\(a\)=πref​\(a\)​exp⁡\(Q​\(a\)/β\)Z\.\\pi^\{\\star\}\(a\)=\\frac\{\\pi\_\{\\mathrm\{ref\}\}\(a\)\\exp\(Q\(a\)/\\beta\)\}\{Z\}\.For anyπ≪πref\\pi\\ll\\pi\_\{\\mathrm\{ref\}\},

KL​\(π∥π⋆\)\\displaystyle\\mathrm\{KL\}\(\\pi\\,\\\|\\,\\pi^\{\\star\}\)=∑aπ​\(a\)​log⁡π​\(a\)π⋆​\(a\)\\displaystyle=\\sum\_\{a\}\\pi\(a\)\\log\\frac\{\\pi\(a\)\}\{\\pi^\{\\star\}\(a\)\}=∑aπ​\(a\)​log⁡π​\(a\)​Zπref​\(a\)​exp⁡\(Q​\(a\)/β\)\\displaystyle=\\sum\_\{a\}\\pi\(a\)\\log\\frac\{\\pi\(a\)Z\}\{\\pi\_\{\\mathrm\{ref\}\}\(a\)\\exp\(Q\(a\)/\\beta\)\}=log⁡Z−1β​∑aπ​\(a\)​Q​\(a\)\+KL​\(π∥πref\)\.\\displaystyle=\\log Z\-\\frac\{1\}\{\\beta\}\\sum\_\{a\}\\pi\(a\)Q\(a\)\+\\mathrm\{KL\}\(\\pi\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\)\.Rearranging gives

∑aπ​\(a\)​Q​\(a\)−β​KL​\(π∥πref\)=β​log⁡Z−β​KL​\(π∥π⋆\)\.\\sum\_\{a\}\\pi\(a\)Q\(a\)\-\\beta\\,\\mathrm\{KL\}\(\\pi\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\)=\\beta\\log Z\-\\beta\\,\\mathrm\{KL\}\(\\pi\\,\\\|\\,\\pi^\{\\star\}\)\.The right\-hand side is maximized uniquely whenπ=π⋆\\pi=\\pi^\{\\star\}, which proves the claim\. ∎

### B\.2Proof of Proposition[3\.1](https://arxiv.org/html/2605.15354#S3.Thmtheorem1)

###### Proof\.

We prove the result by backward induction\. At the terminal statesT=\(z0,0,c\)s\_\{T\}=\(z\_\{0\},0,c\), the value is

VT⋆​\(sT\)=R​\(z0;c\)\.V\_\{T\}^\{\\star\}\(s\_\{T\}\)=R\(z\_\{0\};c\)\.AssumeVh\+1⋆V\_\{h\+1\}^\{\\star\}gives the optimal regularized value from steph\+1h\+1onward\. At states=shs=s\_\{h\}, if the policy chooses a distributionπh\(⋅∣s\)\\pi\_\{h\}\(\\cdot\\mid s\), then the one\-step regularized objective is

∑a∈𝒜​\(s\)πh\(a∣s\)Qh⋆\(s,a\)−βKL\(πh\(⋅∣s\)∥πref\(⋅∣s\)\),\\sum\_\{a\\in\\mathcal\{A\}\(s\)\}\\pi\_\{h\}\(a\\mid s\)Q\_\{h\}^\{\\star\}\(s,a\)\-\\beta\\,\\mathrm\{KL\}\\\!\\left\(\\pi\_\{h\}\(\\cdot\\mid s\)\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\\mid s\)\\right\),where, because the MDP state update is deterministic,

Qh⋆​\(s,a\)=Vh\+1⋆​\(s′\),s′=\(a,T−h−1,c\)\.Q\_\{h\}^\{\\star\}\(s,a\)=V\_\{h\+1\}^\{\\star\}\(s^\{\\prime\}\),\\qquad s^\{\\prime\}=\(a,T\-h\-1,c\)\.Applying Lemma[B\.1](https://arxiv.org/html/2605.15354#A2.Thmtheorem1)withQ​\(a\)=Qh⋆​\(s,a\)Q\(a\)=Q\_\{h\}^\{\\star\}\(s,a\)yields

Vh⋆​\(s\)=β​log​∑a∈𝒜​\(s\)πref​\(a∣s\)​exp⁡\(Qh⋆​\(s,a\)/β\),V\_\{h\}^\{\\star\}\(s\)=\\beta\\log\\sum\_\{a\\in\\mathcal\{A\}\(s\)\}\\pi\_\{\\mathrm\{ref\}\}\(a\\mid s\)\\exp\(Q\_\{h\}^\{\\star\}\(s,a\)/\\beta\),and the unique maximizing policy is

πh⋆​\(a∣s\)=πref​\(a∣s\)​exp⁡\(Qh⋆​\(s,a\)−Vh⋆​\(s\)β\)\.\\pi\_\{h\}^\{\\star\}\(a\\mid s\)=\\pi\_\{\\mathrm\{ref\}\}\(a\\mid s\)\\exp\\\!\\left\(\\frac\{Q\_\{h\}^\{\\star\}\(s,a\)\-V\_\{h\}^\{\\star\}\(s\)\}\{\\beta\}\\right\)\.This proves the Bellman recursion and the policy form\. Taking expectation over the initial noisezT∼p​\(⋅\)z\_\{T\}\\sim p\(\\cdot\)gives the optimal value of𝒥β​\(π;c\)\\mathcal\{J\}\_\{\\beta\}\(\\pi;c\)\. ∎

### B\.3Proof of Corollary[3\.2](https://arxiv.org/html/2605.15354#S3.Thmtheorem2)

###### Proof\.

By[Eq\.˜10](https://arxiv.org/html/2605.15354#S3.E10),

πh⋆​\(𝒢∣s\)=∑a∈𝒢πref​\(a∣s\)​exp⁡\(Qh⋆​\(s,a\)/β\)∑a∈𝒜​\(s\)πref​\(a∣s\)​exp⁡\(Qh⋆​\(s,a\)/β\)\.\\pi\_\{h\}^\{\\star\}\(\\mathcal\{G\}\\mid s\)=\\frac\{\\sum\_\{a\\in\\mathcal\{G\}\}\\pi\_\{\\mathrm\{ref\}\}\(a\\mid s\)\\exp\(Q\_\{h\}^\{\\star\}\(s,a\)/\\beta\)\}\{\\sum\_\{a\\in\\mathcal\{A\}\(s\)\}\\pi\_\{\\mathrm\{ref\}\}\(a\\mid s\)\\exp\(Q\_\{h\}^\{\\star\}\(s,a\)/\\beta\)\}\.LetpG=πref​\(𝒢∣s\)p\_\{G\}=\\pi\_\{\\mathrm\{ref\}\}\(\\mathcal\{G\}\\mid s\)\. Using the assumptions

Qh⋆​\(s,a\)≥b\+Δ\(a∈𝒢\),Qh⋆​\(s,a\)≤b\(a∉𝒢\),Q\_\{h\}^\{\\star\}\(s,a\)\\geq b\+\\Delta\\quad\(a\\in\\mathcal\{G\}\),\\qquad Q\_\{h\}^\{\\star\}\(s,a\)\\leq b\\quad\(a\\notin\\mathcal\{G\}\),we obtain

∑a∈𝒢πref​\(a∣s\)​exp⁡\(Qh⋆​\(s,a\)/β\)≥e\(b\+Δ\)/β​pG,\\sum\_\{a\\in\\mathcal\{G\}\}\\pi\_\{\\mathrm\{ref\}\}\(a\\mid s\)\\exp\(Q\_\{h\}^\{\\star\}\(s,a\)/\\beta\)\\geq e^\{\(b\+\\Delta\)/\\beta\}p\_\{G\},and

∑a∉𝒢πref​\(a∣s\)​exp⁡\(Qh⋆​\(s,a\)/β\)≤eb/β​\(1−pG\)\.\\sum\_\{a\\notin\\mathcal\{G\}\}\\pi\_\{\\mathrm\{ref\}\}\(a\\mid s\)\\exp\(Q\_\{h\}^\{\\star\}\(s,a\)/\\beta\)\\leq e^\{b/\\beta\}\(1\-p\_\{G\}\)\.Therefore

πh⋆​\(𝒢∣s\)≥e\(b\+Δ\)/β​pGe\(b\+Δ\)/β​pG\+eb/β​\(1−pG\)=eΔ/β​pGeΔ/β​pG\+1−pG\.\\pi\_\{h\}^\{\\star\}\(\\mathcal\{G\}\\mid s\)\\geq\\frac\{e^\{\(b\+\\Delta\)/\\beta\}p\_\{G\}\}\{e^\{\(b\+\\Delta\)/\\beta\}p\_\{G\}\+e^\{b/\\beta\}\(1\-p\_\{G\}\)\}=\\frac\{e^\{\\Delta/\\beta\}p\_\{G\}\}\{e^\{\\Delta/\\beta\}p\_\{G\}\+1\-p\_\{G\}\}\.∎

### B\.4Proof of Proposition[3\.3](https://arxiv.org/html/2605.15354#S3.Thmtheorem3)

###### Proof\.

For compactness, omit the representation superscriptrrand the statess\. Let

π​\(a\)=∏j=1Mπj​\(aj∣a<j\),πref​\(a\)=∏j=1Mπref,j​\(aj∣a<j\)\.\\pi\(a\)=\\prod\_\{j=1\}^\{M\}\\pi\_\{j\}\(a\_\{j\}\\mid a\_\{<j\}\),\\qquad\\pi\_\{\\mathrm\{ref\}\}\(a\)=\\prod\_\{j=1\}^\{M\}\\pi\_\{\\mathrm\{ref\},j\}\(a\_\{j\}\\mid a\_\{<j\}\)\.Then

log⁡π​\(a\)πref​\(a\)=∑j=1Mlog⁡πj​\(aj∣a<j\)πref,j​\(aj∣a<j\)\.\\log\\frac\{\\pi\(a\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(a\)\}=\\sum\_\{j=1\}^\{M\}\\log\\frac\{\\pi\_\{j\}\(a\_\{j\}\\mid a\_\{<j\}\)\}\{\\pi\_\{\\mathrm\{ref\},j\}\(a\_\{j\}\\mid a\_\{<j\}\)\}\.Taking expectation undera∼πa\\sim\\pi,

KL​\(π∥πref\)\\displaystyle\\mathrm\{KL\}\(\\pi\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\)=𝔼a∼π​\[∑j=1Mlog⁡πj​\(aj∣a<j\)πref,j​\(aj∣a<j\)\]\\displaystyle=\\mathbb\{E\}\_\{a\\sim\\pi\}\\left\[\\sum\_\{j=1\}^\{M\}\\log\\frac\{\\pi\_\{j\}\(a\_\{j\}\\mid a\_\{<j\}\)\}\{\\pi\_\{\\mathrm\{ref\},j\}\(a\_\{j\}\\mid a\_\{<j\}\)\}\\right\]=∑j=1M𝔼a<j∼π​\[∑ajπj​\(aj∣a<j\)​log⁡πj​\(aj∣a<j\)πref,j​\(aj∣a<j\)\]\\displaystyle=\\sum\_\{j=1\}^\{M\}\\mathbb\{E\}\_\{a\_\{<j\}\\sim\\pi\}\\left\[\\sum\_\{a\_\{j\}\}\\pi\_\{j\}\(a\_\{j\}\\mid a\_\{<j\}\)\\log\\frac\{\\pi\_\{j\}\(a\_\{j\}\\mid a\_\{<j\}\)\}\{\\pi\_\{\\mathrm\{ref\},j\}\(a\_\{j\}\\mid a\_\{<j\}\)\}\\right\]=∑j=1M𝔼a<j∼π\[KL\(πj\(⋅∣a<j\)∥πref,j\(⋅∣a<j\)\)\]\.\\displaystyle=\\sum\_\{j=1\}^\{M\}\\mathbb\{E\}\_\{a\_\{<j\}\\sim\\pi\}\\left\[\\mathrm\{KL\}\\\!\\left\(\\pi\_\{j\}\(\\cdot\\mid a\_\{<j\}\)\\,\\\|\\,\\pi\_\{\\mathrm\{ref\},j\}\(\\cdot\\mid a\_\{<j\}\)\\right\)\\right\]\.Restoringssandrrgives[Eq\.˜14](https://arxiv.org/html/2605.15354#S3.E14)\. ∎

### B\.5Proof of Corollary[3\.4](https://arxiv.org/html/2605.15354#S3.Thmtheorem4)

###### Proof\.

By Proposition[3\.3](https://arxiv.org/html/2605.15354#S3.Thmtheorem3),

KL\(π\(r\)\(⋅∣s\)∥πref\(r\)\(⋅∣s\)\)=∑j=1Mr​\(s\)Dj\(s\),\\mathrm\{KL\}\\\!\\left\(\\pi^\{\(r\)\}\(\\cdot\\mid s\)\\,\\\|\\,\\pi^\{\(r\)\}\_\{\\mathrm\{ref\}\}\(\\cdot\\mid s\)\\right\)=\\sum\_\{j=1\}^\{M\_\{r\}\(s\)\}D\_\{j\}\(s\),where

Dj\(s\):=𝔼a<j∼π\(r\)\[KL\(πj\(r\)\(⋅∣s,a<j\)∥πref,j\(r\)\(⋅∣s,a<j\)\)\]\.D\_\{j\}\(s\):=\\mathbb\{E\}\_\{a\_\{<j\}\\sim\\pi^\{\(r\)\}\}\\left\[\\mathrm\{KL\}\\\!\\left\(\\pi^\{\(r\)\}\_\{j\}\(\\cdot\\mid s,a\_\{<j\}\)\\,\\\|\\,\\pi^\{\(r\)\}\_\{\\mathrm\{ref\},j\}\(\\cdot\\mid s,a\_\{<j\}\)\\right\)\\right\]\.If the left\-hand side is at mostη\\eta, then

1Mr​\(s\)​∑j=1Mr​\(s\)Dj​\(s\)≤ηMr​\(s\)\.\\frac\{1\}\{M\_\{r\}\(s\)\}\\sum\_\{j=1\}^\{M\_\{r\}\(s\)\}D\_\{j\}\(s\)\\leq\\frac\{\\eta\}\{M\_\{r\}\(s\)\}\.∎

### B\.6Proof of Corollary[3\.5](https://arxiv.org/html/2605.15354#S3.Thmtheorem5)

###### Proof\.

By definition,

Latom​\(n\)=n\+\(n2\)=n2\+n2,L\_\{\\mathrm\{atom\}\}\(n\)=n\+\\binom\{n\}\{2\}=\\frac\{n^\{2\}\+n\}\{2\},and

Lmotif​\(n\)=n\+\(n2\)\+n​\(n−1\)=3​n2−n2\.L\_\{\\mathrm\{motif\}\}\(n\)=n\+\\binom\{n\}\{2\}\+n\(n\-1\)=\\frac\{3n^\{2\}\-n\}\{2\}\.Therefore

Lmotif​\(nmotif​\(x\)\)Latom​\(natom​\(x\)\)=3​nmotif​\(x\)2−nmotif​\(x\)natom​\(x\)2\+natom​\(x\)\.\\frac\{L\_\{\\mathrm\{motif\}\}\(n\_\{\\mathrm\{motif\}\}\(x\)\)\}\{L\_\{\\mathrm\{atom\}\}\(n\_\{\\mathrm\{atom\}\}\(x\)\)\}=\\frac\{3n\_\{\\mathrm\{motif\}\}\(x\)^\{2\}\-n\_\{\\mathrm\{motif\}\}\(x\)\}\{n\_\{\\mathrm\{atom\}\}\(x\)^\{2\}\+n\_\{\\mathrm\{atom\}\}\(x\)\}\.Let

χ​\(x\)=natom​\(x\)nmotif​\(x\)\.\\chi\(x\)=\\frac\{n\_\{\\mathrm\{atom\}\}\(x\)\}\{n\_\{\\mathrm\{motif\}\}\(x\)\}\.Since

3​nmotif​\(x\)2−nmotif​\(x\)≤3​nmotif​\(x\)23n\_\{\\mathrm\{motif\}\}\(x\)^\{2\}\-n\_\{\\mathrm\{motif\}\}\(x\)\\leq 3n\_\{\\mathrm\{motif\}\}\(x\)^\{2\}and

natom​\(x\)2\+natom​\(x\)≥natom​\(x\)2,n\_\{\\mathrm\{atom\}\}\(x\)^\{2\}\+n\_\{\\mathrm\{atom\}\}\(x\)\\geq n\_\{\\mathrm\{atom\}\}\(x\)^\{2\},we obtain

Lmotif​\(nmotif​\(x\)\)Latom​\(natom​\(x\)\)≤3​nmotif​\(x\)2natom​\(x\)2=3χ​\(x\)2\.\\frac\{L\_\{\\mathrm\{motif\}\}\(n\_\{\\mathrm\{motif\}\}\(x\)\)\}\{L\_\{\\mathrm\{atom\}\}\(n\_\{\\mathrm\{atom\}\}\(x\)\)\}\\leq\\frac\{3n\_\{\\mathrm\{motif\}\}\(x\)^\{2\}\}\{n\_\{\\mathrm\{atom\}\}\(x\)^\{2\}\}=\\frac\{3\}\{\\chi\(x\)^\{2\}\}\.∎

## Appendix CDetails on Data

### C\.1Datasets

For all random sampling, we set random seed to 42\.

#### C\.1\.1Pretraining Datasets

For unconditional pretraining \(PT\), we curate two unlabeled datasets\.

- •Polymer: The polymer corpus contains 12,792 unique real\-world polymers\.
- •Small molecule: We curate a 10,000\-molecule unlabeled corpus from MoleculeNet\[Wuet al\.,[2018](https://arxiv.org/html/2605.15354#bib.bib13)\]by sampling valid and unique SMILES from datasets spanning physical chemistry, biophysics, and physiology\.

No target labels or task identifiers are used during this stage, so PT remains fully unconditional\. This design provides a broad structural prior before adapting the model to heterogeneous downstream molecular tasks\.

We report RDKit\-derived structural statistics in[Table˜6](https://arxiv.org/html/2605.15354#A3.T6)\. We further visualize the real\-atom and ring\-count distributions in[Figure˜2](https://arxiv.org/html/2605.15354#A3.F2)\.

Table 6:Structural statistics of the unconditional pretraining datasets\. Values are reported as mean / maximum\.![Refer to caption](https://arxiv.org/html/2605.15354v1/x2.png)Figure 2:Atom and ring count for pretraining datasets\.
#### C\.1\.2Post\-training Datasets

For SFT, RL, and evaluation, we construct three target\-conditioned task sets, with statistics summarized in[Table˜7](https://arxiv.org/html/2605.15354#A3.T7)\. The polymer DFT set contains electronic\-structure properties relevant to functional polymer design, including electron affinity \(Eea\), bulk band gap \(Egb\), and chain band gap \(Egc\) for training, with ionization energy \(Ei\) and dielectric constant \(EPS\) held out as unseen targets\[Xuet al\.,[2023](https://arxiv.org/html/2605.15354#bib.bib15)\]\. The polymer gas set includes O2, N2, and CO2permeability targets for membrane materials\[Thorntonet al\.,[2012](https://arxiv.org/html/2605.15354#bib.bib14)\]\. All gas\-permeability values are transformed into log space following previous work\[Ma and Luo,[2020](https://arxiv.org/html/2605.15354#bib.bib33)\]\. The drug set spans physical chemistry, target bioactivity, and physiological permeability through the numerical FreeSolv dataset, and two class\-balanced datasets BACE and BBBP from\[Wuet al\.,[2018](https://arxiv.org/html/2605.15354#bib.bib13)\]\.

Table 7:Task definitions and raw statistics for post\-training datasets\. Target ranges are reported in the original property units before any target transformation\.We visualize task\-set\-level target distributions in[Figure˜3](https://arxiv.org/html/2605.15354#A3.F3)\. Regression tasks are shown as target\-value histograms, while classification tasks are shown by class counts\.

![Refer to caption](https://arxiv.org/html/2605.15354v1/x3.png)Figure 3:Target distributions for conditional training and evaluation datasets\.For RL post\-training, we use the same data sources as SFT\. Within each task set, we randomly subsample each task’s training split to match the smallest training\-set size among the tasks in that set\. The validation and test splits are kept fixed between SFT and RL, ensuring that both stages are evaluated on the same held\-out conditions\.

### C\.2Tokenizer Preparation

We learn domain\-specific tokenizers for the polymer and molecule domains using the same default configuration, V1000\-R80, where V denotes the total motif vocabulary size and R denotes the ring vocabulary size\.

#### C\.2\.1Vocabulary Learning

We construct the motif tokenizer using the data\-driven Node Pair Encoding \(NPE\) algorithm of DemoDiff\[Liuet al\.,[2025](https://arxiv.org/html/2605.15354#bib.bib6)\]\. Each molecule is represented as a graph of atom\-disjoint connected units that cover the original atom\-level graph\. Inter\-unit edges store the bond type and the attachment positions on both endpoint units, enabling lossless reconstruction to the original atom\-level molecular graph\.

Vocabulary learning starts from singleton atom tokens, including atom types observed in the pretraining corpus and the polymerization symbol “\*”\. For constrained NPE, we further enumerate frequent maximal ring units and add the top\-RRrings to the initial vocabulary, treating them as indivisible units during later merging to avoid ambiguous ring attachments\. NPE then repeatedly counts adjacent unit pairs in the tokenized training graphs, merges the most frequent pair into a new vocabulary unit, and retokenizes the graphs until the target vocabulary size is reached\.

#### C\.2\.2Vocabulary Size Selection

We choose the tokenizer configuration based on two criteria: motif\-occurrence coverage and graph\-size compression\.

##### Motif\-occurrence Coverage\.

Motif\-occurrence coverage measures the fraction of token occurrences explained by the top\-kkmost frequent motifs after tokenization\. As shown in[Figure˜4](https://arxiv.org/html/2605.15354#A3.F4), V1000\-R80 provides the strongest coverage efficiency across polymer and molecule pretraining datasets, especially under small and medium retained\-token budgets\. Larger vocabularies distribute occurrences over more rare motifs, reducing top\-kkcoverage at the same retained vocabulary size\.

![Refer to caption](https://arxiv.org/html/2605.15354v1/x4.png)Figure 4:Motif\-occurrence coverage under different tokenizer configurations\. Coverage is the fraction of token occurrences captured by the top\-kkmost frequent motifs after tokenization\. V denotes the total motif vocabulary size, and R denotes the ring vocabulary size\. The dashed line marks the 95% coverage reference\.
##### Graph\-size Compression\.

We further evaluate graph\-size compression in[Table˜8](https://arxiv.org/html/2605.15354#A3.T8)\. V1000\-R80 reduces polymer graphs from 32\.71 atom nodes, including dummy wildcard attachment atoms, to 5\.85 motif nodes on average, and molecule graphs from 20\.87 atom nodes to 9\.22 motif nodes\. This corresponds to an82\.1%82\.1\\%node reduction for polymers and a55\.8%55\.8\\%reduction for molecules\. Although larger vocabularies or ring vocabularies can further reduce graph size, V1000\-R80 offers a better coverage\-compression trade\-off\. The weaker compression on small molecules is consistent with the more modest gains observed on the drug\-related molecular task set in[Table˜3](https://arxiv.org/html/2605.15354#S4.T3)\.

Table 8:Average graph\-size compression under representative tokenizer configurations\. Compression is the ratio of average atom nodes to average motif nodes, and reduction is the corresponding percentage decrease in node count\.We therefore use V1000\-R80 as the default tokenizer configuration\. It provides the best occurrence\-coverage efficiency while still achieving substantial graph compression, offering a practical balance between compactness, motif expressiveness, and stable training\.

## Appendix DDetails on Experiments

##### Hardware

Each training or evaluation run can be executed on a single NVIDIA A100 GPU, no distributed training is required\.

### D\.1Training Details

All CoMole variants use the same approximately 0\.15B\-parameter backbone, with transformer depth 12, hidden size 1024, 16 attention heads, and MLP ratio 4\. We useT=500T=500diffusion steps\. We train one CoMole instance for each task set described in[Section˜C\.1\.2](https://arxiv.org/html/2605.15354#A3.SS1.SSS2)\. Each instance follows the same three\-stage pipeline: PT, SFT, and PPO\-based RL\.

We use a consistent set of training hyperparameters across task sets:

- •PTis run for 400 epochs with batch size 64, learning rate2×10−52\\times 10^\{\-5\}, gradient clipping 0\.1, and weight decay10−1210^\{\-12\}\.
- •SFTis run for up to 3,000 epochs with batch size 64 and learning rate2×10−52\\times 10^\{\-5\}\. Checkpoints are selected by validation controllability averaged over tasks\.
- •RLis run for up to 400 epochs with batch size 32, learning rate1×10−51\\times 10^\{\-5\}, PPO clip range 0\.2, value loss coefficient 0\.5, entropy coefficient 0\.001, KL coefficient 0\.01, one rollout collection pass, and two PPO update passes\. For computational efficiency, we apply partial\-suffix fine\-tuning to the final 30 reverse denoising steps, following\[Renet al\.,[2024](https://arxiv.org/html/2605.15354#bib.bib8)\]\. For regression tasks, the property reward uses a Gaussian target\-alignment term withσ=0\.5\\sigma=0\.5after target normalization\. For classification tasks, the property reward is the evaluator\-predicted probability assigned to the target class\. The terminal reward combines the property term with the validity term usingwval=0\.1w\_\{\\mathrm\{val\}\}=0\.1\.

For baselines, we follow the official hyperparameter settings whenever available\. When official settings are unavailable or incomplete, we use Optuna\[Akibaet al\.,[2019](https://arxiv.org/html/2605.15354#bib.bib34)\]to tune hyperparameters on the validation set and report test performance using the selected configuration\. All methods are evaluated on the same data splits, target conditions, and property evaluators\.

### D\.2Computational Cost and Execution Time

We report the wall\-clock runtime and peak memory usage of the main training stages in[Table˜9](https://arxiv.org/html/2605.15354#A4.T9)\. All runs are conducted on a single NVIDIA A100 GPU\. Peak CPU memory refers to host\-side process memory, while peak GPU memory refers to allocated device memory\. The polymer pretraining run is shared by the two polymer benchmarks, namely Polymer gas and Polymer DFT, whereas the molecule pretraining run is used for the Drug benchmark\.

The reported runtime and memory usage characterize the practical computational cost of training CoMole under our experimental setting\. These measurements may vary with hardware configuration, software environment, I/O overhead, and cluster scheduling conditions\.

Table 9:Runtime and peak memory usage of the main training stages on a single NVIDIA A100 GPU\.StageData TypePeak CPU MemoryPeak GPU MemoryExecution timePretrainingPolymer∼\\sim5\.8 GB∼\\sim12\.9 GB3h 17m 20sPretrainingMolecule∼\\sim5\.8 GB∼\\sim10\.3 GB1h 45m 33sSFTPolymer gas∼\\sim8\.0 GB∼\\sim19\.4 GB6h 55m 45sSFTPolymer DFT∼\\sim6\.7 GB∼\\sim10\.0 GB5h 35m 34sSFTDrug∼\\sim6\.1 GB∼\\sim10\.3 GB4h 25m 1sRL post\-trainingPolymer gas∼\\sim6\.0 GB∼\\sim7\.4 GB2d 3h 49mRL post\-trainingPolymer DFT∼\\sim6\.0 GB∼\\sim7\.5 GB1d 18h 41mRL post\-trainingDrug∼\\sim6\.3 GB∼\\sim7\.4 GB2d 7h 6m

### D\.3Training Dynamics

We visualize validation dynamics during SFT and RL post\-training in[Figures˜5](https://arxiv.org/html/2605.15354#A4.F5)and[6](https://arxiv.org/html/2605.15354#A4.F6)\. For the drug benchmark, Avg\. MAE is computed as the macro\-average over FreeSolv, BACE, and BBBP, giving each task equal weight\. For categorical tasks such as BACE and BBBP, MAE is defined as the average absolute difference between the predicted positive\-class probability and the binary target label\.

During SFT, all three benchmarks exhibit stable training behavior\. Validity increases rapidly and remains high throughout training, while fragment similarity stays close to the reference distribution\. FCD decreases sharply in the early stage for the gas\-permeability benchmark and remains relatively stable for the DFT and drug benchmarks\. Avg\. MAE decreases substantially for the DFT and drug benchmarks and improves overall for the gas\-permeability benchmark, despite moderate fluctuations\. These results indicate that SFT improves controllability while preserving the structural prior learned during pretraining\.

RL post\-training shows benchmark\-dependent behavior\. On the two polymer benchmarks, RL further improves Avg\. MAE while largely maintaining distribution fidelity\. The gas\-permeability benchmark preserves high validity and stable fragment similarity, whereas the DFT benchmark improves controllability with moderate fluctuations in validity and distribution metrics\.

In contrast, the drug benchmark exhibits a stronger distribution shift during RL\. Although Avg\. MAE decreases and validity remains high, FCD increases substantially and fragment similarity drops\. This trend is consistent with the heterogeneous nature of the drug benchmark, where FreeSolv, BACE, and BBBP span physical chemistry, target\-specific bioactivity, and physiological permeability, respectively\[Wuet al\.,[2018](https://arxiv.org/html/2605.15354#bib.bib13)\]\. Compared with a benchmark defined by a single drug\-property family, this setting introduces more diverse target\-alignment pressures, so RL improves controllability while shifting generation away from the aggregate reference distribution\.

![Refer to caption](https://arxiv.org/html/2605.15354v1/x5.png)Figure 5:SFT validation dynamics across the polymer DFT, polymer gas\-permeability, and drug benchmarks\. Validity denotes raw validity without rule\-based checking\. SFT steadily improves controllability while maintaining high distribution fidelity\.![Refer to caption](https://arxiv.org/html/2605.15354v1/x6.png)Figure 6:RL validation dynamics across the polymer DFT, polymer gas\-permeability, and drug benchmarks\. Validity denotes raw validity without rule\-based checking\. RL further improves target controllability\. The polymer benchmarks retain relatively stable distribution metrics, whereas the drug benchmark exhibits a stronger shift toward target\-aligned regions\.
### D\.4More Results

In the following results, we report validity measured without rule\-based checking, denoted as Validityw/o​check\{\}\_\{\\mathrm\{w/o\\ check\}\}\.

#### D\.4\.1Joint Training on Polymer Properties

We report gas permeability and DFT properties as separate polymer task sets in the main experiments because they correspond to distinct property families and empirical distributions\. To examine whether they can be merged into a single heterogeneous polymer setting, we compare two training protocols:

- •*Separated*: the main experimental setting, where gas\-permeability and DFT targets are trained and evaluated using two benchmark\-specific models\.
- •*Joint*: a single model trained jointly on all six polymer targets, including O2Perm, N2Perm, CO2Perm, Eea, Egb, and Egc\.

For the separated setting, target MAEs are taken from the corresponding gas and DFT benchmark runs\. Validityw/o​check\{\}\_\{\\mathrm\{w/o\\ check\}\}and distribution metrics are reported as the arithmetic mean of the two benchmark\-level results, while Avg\. MAE is the simple average over all six target\-level MAEs\.

As shown in[Table˜10](https://arxiv.org/html/2605.15354#A4.T10), joint training does not cause a general collapse in validity or distribution learning\. However, merging the two polymer property families weakens target control\. At the SFT stage, the six\-target Avg\. MAE increases from 0\.5400 in the separated setting to 0\.5786 under joint training\. After RL, the degradation becomes more pronounced: Avg\. MAE increases from 0\.3713 to 0\.5916\. This drop is mainly driven by gas\-permeability targets, whose average MAE increases from 0\.4315 to 0\.8510, whereas DFT targets are less affected, changing from 0\.3111 to 0\.3322\. This suggests that gas\-permeability targets introduce stronger distributional heterogeneity into the six\-target setting\.

These results indicate that merging heterogeneous polymer property families can introduce negative transfer in target controllability, even when validity and distribution\-level metrics remain competitive\. We therefore keep gas\-permeability and DFT properties as separate polymer task sets in the main evaluation\.

Table 10:Joint training analysis on polymer tasks\.*Separated*denotes the main setting where gas permeability and DFT properties are trained as two benchmark\-specific models\.*Joint*denotes a single model trained jointly on all six polymer targets\. For*Separated*, validity and distribution metrics are arithmetic means of the two benchmark\-level results, while Avg\. MAE is the simple average over the six target\-level MAEs\. Best results are inbold\.
#### D\.4\.2Other RL Post\-training Objective

Our main experiments instantiate RL post\-training with PPO\. To test whether our pipeline is specific to PPO, we additionally evaluate a GDPO\[Liuet al\.,[2024b](https://arxiv.org/html/2605.15354#bib.bib7)\]policy optimization objective on the gas\-permeability polymer benchmark\. We denote this variant as CoMolew/ GDPO\{\}\_\{\\text\{w/ GDPO\}\}, which keeps the motif tokenizer, SFT checkpoint, terminal reward, and evaluation protocol unchanged\.

As shown in[Table˜11](https://arxiv.org/html/2605.15354#A4.T11), the GDPO update is compatible with our RL pipeline and improves target control over the SFT checkpoint\. Compared withCoMolew/o RL\\textnormal\{CoMole\}\_\{\\text\{w/o RL\}\}, CoMolew/ GDPO\{\}\_\{\\text\{w/ GDPO\}\}reduces Avg\. MAE from 0\.7013 to 0\.5879 while maintaining high raw validity\. It also substantially outperforms the original atom\-level GDPO baseline in validity and distribution similarity, highlighting the benefit of motif\-aware actions\.

However, PPO remains stronger in this setting\. These results suggest that our pipeline is not tied to a specific RL algorithm, while PPO provides an effective instantiation for the main experiments\.

Table 11:Alternative RL algorithm on the gas\-permeability benchmark\. CoMolew/ GDPO\{\}\_\{\\text\{w/ GDPO\}\}replaces PPO in CoMole with a GDPO\-style policy update, while keeping all other settings unchanged\. Best results are inbold\.
#### D\.4\.3Further Analysis ofCoMolew/o RL\\textnormal\{CoMole\}\_\{\\text\{w/o RL\}\}

In this section, we analyze whether the SFT reference policyπref\\pi\_\{\\mathrm\{ref\}\}mentioned in[Section˜3\.2\.1](https://arxiv.org/html/2605.15354#S3.SS2.SSS1)already assigns trajectory\-level probability to low\-error generations before RL\. Using theCoMolew/o RL\\textnormal\{CoMole\}\_\{\\text\{w/o RL\}\}trained on the polymer DFT benchmark, for each test conditioncc, we drawKKindependent samples and compute the absolute target error under that condition\. We report two metrics:

- •mean​\-​MAE​@​K\\mathrm\{mean\\text\{\-\}MAE\}@Kaverages the errors over all sampled molecules and conditions, reflecting standard sampling quality\.
- •best​\-​MAE​@​K\\mathrm\{best\\text\{\-\}MAE\}@Kfirst selects, for each condition, the lowest\-error sample among theKKdraws, and then averages this best\-sample error over conditions\.

As shown in[Table˜12](https://arxiv.org/html/2605.15354#A4.T12),best​\-​MAE​@​K\\mathrm\{best\\text\{\-\}MAE\}@Kdecreases from 0\.3676 atK=1K=1to 0\.0261 atK=32K=32, whilemean​\-​MAE​@​K\\mathrm\{mean\\text\{\-\}MAE\}@Kremains relatively stable\. This suggests that the SFT reference policy already covers low\-error candidates, but standard sampling does not select them reliably\. RL can therefore improve controllability by increasing the probability of these target\-aligned generations\.

Table 12:Repeated sampling fromCoMolew/o RL\\textnormal\{CoMole\}\_\{\\text\{w/o RL\}\}on the polymer DFT benchmark, with exactlyKKdraws per test condition\.
#### D\.4\.4Oracle Robustness

To validate whether our conclusions depend on the property oracle used for reward and evaluation, we perform an oracle robustness analysis\. During RL post\-training, the property component of the terminal reward is computed by a learned oracle, so a model could in principle overfit to the biases of a specific evaluator\.

We therefore freeze the generated candidates from the main experiments and re\-score them using alternative predictors trained on the same training splits, including Random Forest\[Breiman,[2001](https://arxiv.org/html/2605.15354#bib.bib35)\], Support Vector Machine, and GRIN\[Zhuet al\.,[2026](https://arxiv.org/html/2605.15354#bib.bib16)\]\. For each oracle, we compute the corresponding controllability metric for every target, rank models by target\-level performance, and report the average rank across the nine evaluated properties in[Table˜13](https://arxiv.org/html/2605.15354#A4.T13)\.

Although the ordering of some baselines varies across oracles, CoMole consistently ranks first under all three evaluators\. This suggests that our main conclusion is robust to the choice of learned oracle and is not driven by a specific evaluator\.

In the main experiments, we use the best\-performing validation oracle for each domain: Random Forest for the drug\-related molecular task set and GRIN for the polymer task sets\. Since GRIN can be trained as a general graph predictor, its main advantage comes from repetition\-aware augmentation for polymer repeat\-unit representations, which provides limited benefit for small molecules\.

Table 13:Oracle robustness for generation evaluation\. Models are sorted by their average target\-level rank across the nine evaluated properties, and only ranks 1–9 are shown here\. The ordering of baselines varies across evaluators, but CoMole consistently ranks first\.
#### D\.4\.5Novelty and Uniqueness

We additionally report novelty and uniqueness as complementary generation\-quality metrics\. These metrics are computed over generated samples aggregated across the evaluated target conditions\. Novelty measures the fraction of valid generated molecules absent from the training set, while uniqueness measures the fraction of non\-duplicated valid generations after canonicalization\.

For conditional generation, however, these metrics are mainly sanity checks against memorization and duplication\. They do not measure whether generated molecules satisfy the desired target properties or lie in a practically useful design region\. Prior work has shown that simple perturbations such as AddCarbon can achieve near\-perfect scores without producing useful candidates\[Renzet al\.,[2019](https://arxiv.org/html/2605.15354#bib.bib37), Tripp and Hernández\-Lobato,[2023](https://arxiv.org/html/2605.15354#bib.bib38)\]\. Therefore, we treat novelty and uniqueness as complementary rather than primary indicators of inverse\-design quality in our main experiments\.

As shown in Table[14](https://arxiv.org/html/2605.15354#A4.T14),CoMolew/o RL\\textnormal\{CoMole\}\_\{\\text\{w/o RL\}\}and CoMole retain novelty and uniqueness, suggesting that CoMole improves target controllability without collapsing to memorized or highly duplicated generations\.

Table 14:Complementary novelty and uniqueness metrics across evaluated conditions\. Novelty and uniqueness are computed over valid generated molecules after canonicalization\.

### D\.5Visualization

Given the quantitative results in[Tables˜1](https://arxiv.org/html/2605.15354#S4.T1)and[2](https://arxiv.org/html/2605.15354#S4.T2), we further visualize whether low\-sample target\-conditioned generation produces chemically plausible structures\. For selected O2Perm, CO2Perm, Eea, and Egb test conditions, we sample 10 candidates from CoMole and show the rank\-1 generated structure\. The rank is computed by averaging the candidate’s rank in target\-value error and its rank in structural similarity to the corresponding held\-out reference structure\. The reference structure is used only for qualitative comparison and is not treated as the unique solution to the inverse\-design problem\.

As shown in[Figures˜7](https://arxiv.org/html/2605.15354#A4.F7)and[8](https://arxiv.org/html/2605.15354#A4.F8), the selected generated structures often retain chemically meaningful motifs or scaffold patterns related to the held\-out references, such as thiophene\-containing units for Eea and Egb, aliphatic ester repeat units for high\-Egb, aromatic imide motifs for CO2Perm, and aromatic sulfone or bulky aromatic ester scaffolds for O2Perm\. These examples qualitatively complement the quantitative results by showing that target\-conditioned samples can remain close to plausible structure chemistry under limited sampling\.

![Refer to caption](https://arxiv.org/html/2605.15354v1/x7.png)Figure 7:Rank\-1 generated structures selected from 10 generated samples separately for Eea and Egb conditions\.![Refer to caption](https://arxiv.org/html/2605.15354v1/x8.png)Figure 8:Rank\-1 generated structures selected from 10 generated samples separately for O2Perm and CO2Perm conditions\.

## Appendix ELimitations and Future Directions

##### Limitations

First, our evaluation covers representative but not exhaustive chemical design settings\. For polymers, we focus on gas\-permeability and DFT\-derived electronic properties, while other practically important targets, such as thermal properties \(TgT\_\{g\}andTcT\_\{c\}\) and mechanical properties, remain unexplored\. For small molecules, our benchmark includes regression and binary classification tasks, but does not yet cover broader condition types such as multi\-class labels\. Expanding the task coverage would provide a more comprehensive assessment of heterogeneous controllable generation\.

Second, we adopt a largely shared training configuration across benchmarks\. For example, CoMole uses the same configuration for the polymer and molecule tokenizers, as described in[Section˜C\.2](https://arxiv.org/html/2605.15354#A3.SS2)\. Although the V1000\-R80 tokenizer provides strong compression for polymers, yielding an82\.1%82\.1\\%node reduction, the compression is less pronounced for small molecules\. In particular, the current tokenizer reduces the RL decision space and credit\-assignment burden more substantially for polymers than for small molecules\.

##### Future Directions

Future work could develop stronger and more adaptive tokenizers to better balance compression, expressiveness, and controllability\. Promising directions include larger molecular vocabularies and unified tokenizers for polymers and small molecules that are less tied to a specific training corpus\.

Another important direction is simultaneous multi\-property control, where generated molecules or polymers must satisfy multiple numerical or categorical conditions at once\. This setting better reflects practical scientific design, where candidates are typically selected under coupled requirements involving property targets, chemical validity, synthesizability, and distributional plausibility\.

More broadly, the long\-term goal of controllable molecular generation is to advance scientific discovery by bridging the gap between hypothetical structure generation and practical inverse design\. Such models can help researchers prioritize more promising molecular candidates under heterogeneous property constraints before committing to expensive simulations, experiments, or screening cycles\. By combining transferable structural priors with task\-conditioned adaptation and reward\-guided post\-training, CoMole provides a step toward foundation\-model\-based design systems for molecules and polymers\.

Similar Articles

New AI tool can generate millions of new molecules

Reddit r/ArtificialInteligence

Researchers from Universitat Rovira i Virgili published a paper in Nature Machine Intelligence introducing CoCoGraph, an AI tool that generates chemically valid novel molecules using a constrained discrete diffusion process.

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

Hugging Face Daily Papers

CogOmniControl is a reasoning-driven framework for controllable video generation that uses a specialized vision-language model (CogVLM) trained on anime production data to infer creative intent from sparse conditions, then guides a diffusion-based generator via reinforcement learning, achieving state-of-the-art results on new benchmarks.