MOLAR: Learning Multimodal Molecular Representations from Noisy Labels

arXiv cs.LG Papers

Summary

MOLAR proposes a noise-aware framework for learning multimodal molecular representations from noisy labels by separating clean-property inference from observed label noise, outperforming baselines on molecular benchmarks.

arXiv:2606.18390v1 Announce Type: new Abstract: Motivation: Noisy labels are a common challenge in molecular property prediction because molecular annotations are often obtained from assays, curated databases, or weak annotation pipelines rather than directly observed clean biological states. Treating recorded labels as reliable supervision can cause models to memorize corrupted observations and learn misleading molecular evidence. In multimodal molecular representation learning, this issue can be amplified by graph-text fusion or alignment, which may propagate label-induced errors across modalities. Results: We propose MOLAR, a noise-aware framework for learning multimodal molecular representations from noisy labels. MOLAR separates latent clean-property inference from recorded-label observation: graph and text views contribute residual evidence to a clean-property distribution, and a categorical label-observation channel maps this distribution to recorded labels for training. This formulation derives posterior label reliability and modality-specific molecular evidence from the model. Experiments on naturally noisy molecular benchmarks and controlled label-flipping benchmarks show that MOLAR consistently outperforms representative baselines. Visualization analyses further show that MOLAR provides interpretable reliability and modality-evidence diagnostics.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:42 AM

# MOLAR: Learning Multimodal Molecular Representations from Noisy Labels
Source: [https://arxiv.org/html/2606.18390](https://arxiv.org/html/2606.18390)
\\journaltitle

xxx\\DOIxxx\\vol\\accessxxx\\appnotesPaper

\\corresp

\[∗\\ast\]Equal Contributions\.

Kunyu ZhangNan YinYu LiEran Segal\\orgdivDepartment of Machine Learning,\\orgnameMohamed bin Zayed University of Artificial Intelligence,\\orgaddress\\streetAI Diyafah St,\\postcode7909,\\stateAbu Dhabi,\\countryUnited Arab Emirates\\orgdivInternational College,\\orgnameZhengzhou University,\\orgaddress\\streetDaxue North Road,\\postcode450000,\\stateHenan,\\countryChina\\orgnameThe Education University of Hong Kong,\\orgaddress\\stateHong Kong,\\countryChina\\orgdivDepartment of Computer Science and Engineering,\\orgnameThe Chinese University of Hong Kong,\\orgaddress\\stateHong Kong,\\countryChina\\orgdivDepartment of Molecular Cell Biology,\\orgnameWeizmann Institute of Science,\\orgaddress\\stateRehovot,\\countryIsrael

\(xxx\)

###### Abstract

Motivation:Noisy labels are a common challenge in molecular property prediction because molecular annotations are often obtained from assays, curated databases, or weak annotation pipelines rather than directly observed clean biological states\. Treating recorded labels as reliable supervision can cause models to memorize corrupted observations and learn misleading molecular evidence\. In multimodal molecular representation learning, this issue can be amplified by graph–text fusion or alignment, which may propagate label\-induced errors across modalities\. Results:We propose MOLAR, a noise\-aware framework for learning multimodal molecular representations from noisy labels\. MOLAR separates latent clean\-property inference from recorded\-label observation: graph and text views contribute residual evidence to a clean\-property distribution, and a categorical label\-observation channel maps this distribution to recorded labels for training\. This formulation derives posterior label reliability and modality\-specific molecular evidence from the model\. Experiments on naturally noisy molecular benchmarks and controlled label\-flipping benchmarks show that MOLAR consistently outperforms representative baselines\. Visualization analyses further show that MOLAR provides interpretable reliability and modality\-evidence diagnostics\.

###### keywords:

Multimodal molecular representation learning, molecular property prediction, learning from noisy labels

## 1Introduction

Multimodal molecular representation learning aims to learn predictive molecular embeddings from multiple views of the same compoundedwards2022translation;liu2023multi;liu2023molca\. Typically, each molecule is represented by a structured molecular graph that characterizes atoms, chemical bonds, and topological connectivityduvenaud2015convolutional;ma2022cross;wang2024chain, together with associated textual descriptions, SMILES\-based language representationssmiles;ross2022large;pei2023biot5, or descriptor summaries that capture complementary semantic, physicochemical, and pharmacological information\. Because these views encode different aspects of molecular behavior, integrating them enables more comprehensive molecular representations and has become an effective strategy for bioactivity prediction, toxicity assessment, physicochemical property modeling, functional annotation, and downstream experimental prioritizationpei2024biot5\+;zheng2025large;boldini2024machine;wang2026sgac\.

![Refer to caption](https://arxiv.org/html/2606.18390v1/x1.png)Figure 1:Overview of the three challenges studied in this work\.\(a\)The prediction target should be separated from potentially noisy recorded labels\.\(b\)Recorded labels should still provide useful supervision for learning clean molecular properties\.\(c\)Graph–text disagreement may indicate label noise, weak modality evidence, or uncertainty, rather than a simple alignment error\.Molecular representation learning has evolved from manually designed chemical encodings to deep graph\-based and multimodal paradigmsmcgibbon2024intuition;wang2025dusego\. Early approaches used molecular fingerprints, physicochemical descriptors, and SMILES strings as compact representations for similarity search and property predictionrogers2010extended;deng2023systematic;wang2022advanced;wang2026riemannian\. With the development of deep learning, SMILES strings and molecular texts have been modeled by recurrent neural networks, convolutional networks, Transformers, and chemical language models, enabling the extraction of syntactic patterns, functional groups, and chemical semantics from sequential molecular descriptionsross2022large;edwards2022translation;pei2023biot5\. However, sequence\-based representations provide only an indirect description of molecular topologyyoshikai2024difficulty;sadeghi2024can;wang2026usbd\. Graph neural networks address this limitation by representing molecules as atom–bond graphs and propagating information over chemical neighborhoods, thereby capturing local substructures, long\-range connectivity, and topology\-dependent molecular patternsma2022cross;wang2024chain;zhao2024molecular\. More recently, multimodal molecular learning has sought to integrate graph\- and text\-derived information to obtain richer molecular representationswu2023molecular;rollins2024molprop;zhang2024mvmrl\. Existing methods commonly combine modalities through cross\-modal attention, contrastive alignment, shared embedding spaces, or molecule–language pretraining objectivesedwards2022translation;liu2023multi;liu2023molca\. These strategies improve representation capacity by exploiting the complementarity between structural and semantic molecular viewsliu2023multi;liu2023molca;fang2024mol\. Despite this progress, molecular property labels are not always clean supervision\. Recorded labels may be affected by measurement variability, assay interference, thresholding decisions, inconsistent experimental protocols, conflicting annotations, database curation, or weak automatic labeling pipelinesbuterez2023mf;boldini2024machine;deng2023systematic\. When such labels are directly treated as ground truth, predictive models may memorize corrupted observations and learn misleading molecular evidence, leading to degraded generalization on more reliable evaluation datawei2023fine;nguyen2024noisy;lin2024learning\. Although noisy\-label learning has been widely studied, existing methods mainly aim to reduce label memorization in unimodal classification through robust losses, co\-training, sample reweighting, semi\-supervised label refinement, or meta reweightingwei2023fine;lin2024learning\. These approaches typically regard corrupted labels as sample\-level training noise, but they do not explicitly distinguish latent molecular properties from recorded noisy labels, nor do they address how complementary molecular views should be used when supervision is unreliable\. This gap becomes especially important in multimodal molecular representation learning\. Most multimodal objectives emphasize stronger graph–text fusion or alignment, implicitly assuming that the recorded label provides trustworthy supervisionedwards2022translation;liu2023multi;liu2023molca;wang2026nested\. Under noisy supervision, this assumption can be problematic: a corrupted label may be memorized by the predictor and further propagated across graph and text representations through the learned alignment\. Moreover, graph–text disagreement may reflect label corruption, insufficient evidence in one modality, or genuine uncertainty about the molecular property, rather than a simple alignment error\.

In this paper, we study noisy\-label multimodal molecular representation learning and aim to develop a principled framework\. As shown in Figure[1](https://arxiv.org/html/2606.18390#S1.F1), this setting raises three key challenges\. First,what should the model predict when recorded labels may be unreliable?Molecular datasets usually provide only recorded labels, whereas the underlying molecular property of interest is not directly observedboldini2024machine;deng2023systematic;wang2024degree\. Directly treating recorded labels as target variables may cause the model to absorb experimental variation, curation errors, or annotation noise as molecular evidencewei2023fine;nguyen2024noisy;lin2024learning\. Second,how can noisy recorded labels still provide useful supervision?Although recorded labels may be corrupted, they remain the main source of supervision\. A noise\-aware formulation should therefore connect the clean\-property posterior to the recorded\-label distribution, rather than either fitting recorded labels directly or discarding themliu2023identifiability;liao2025instance;nguyen2024noisy;wang2026brain\. Third,how should graph–text disagreement be interpreted under noisy supervision?In multimodal molecular learning, disagreement between graph and text views may reflect label corruption, insufficient evidence in one modality, or genuine uncertainty about the molecular propertyedwards2022translation;liu2023multi\. Indiscriminately enforcing graph–text agreement may suppress useful modality\-specific evidence and propagate label\-induced errorsliu2023multi;liu2023molca;wei2023fine\.

To address these challenges, we propose MOLAR, a noise\-aware multimodal framework for learning molecular representations from noisy labels\. Rather than treating recorded labels as clean targets, MOLAR explicitly separates clean molecular property inference from recorded\-label observation\. Specifically, graph and text views are first encoded into modality\-specific representations and then formulated as residual natural\-parameter evidence for a latent categorical clean\-property distribution\. To use recorded labels without directly fitting them as clean supervision, MOLAR introduces a categorical label\-observation channel that maps the clean\-property posterior to the recorded\-label distribution\. This probabilistic formulation links latent molecular properties to noisy supervision and naturally derives posterior label reliability from the model\. To handle graph–text disagreement under unreliable supervision, MOLAR regularizes high\-confidence contradictory evidence between modalities while preserving modality\-specific information\. In addition, a perturbation\-consistent clean\-posterior regularizer improves stability under label\-preserving molecular perturbations\. To validate the effectiveness of MOLAR, we conduct experiments on naturally noisy molecular benchmarksbuterez2023mfand controlled label\-flipping benchmarkswu2018moleculenet, showing that MOLAR achieves state\-of\-the\-art performance over representative graph\-only, multimodal, and noisy\-label learning baselines while providing interpretable posterior reliability and modality\-specific molecular evidence\.

Our contributions are summarized as follows: \(1\) We formulate noisy\-label multimodal molecular representation learning around three challenges: separating latent molecular properties from recorded labels, using recorded labels as supervision through a label\-observation channel, and interpreting graph–text disagreement under unreliable supervision\. \(2\) We propose MOLAR, a noise\-aware framework that composes graph and text views as residual natural\-parameter evidence for clean\-property prediction and connects this prediction to recorded labels through a categorical label\-observation channel\. \(3\) We conduct experiments on naturally noisy molecular benchmarks and controlled label\-flipping benchmarks, demonstrating state\-of\-the\-art performance over representative graph\-based, multimodal, and noisy\-label learning baselines, together with interpretable posterior reliability and modality\-specific molecular evidence\.

## 2Materials and methods

### 2\.1Preliminary

Given a molecule represented by a molecular graphG=\(V,E,X\)G=\(V,E,X\), whereVVis the set of atoms,EEis the set of chemical bonds, andXXis the atom\-feature matrix, together with a text\-derived molecular viewTT, we study noisy\-label multimodal molecular property prediction\. The recorded labely~∈𝒴\\tilde\{y\}\\in\\mathcal\{Y\}may be affected by experimental variation, database curation, or weak annotation, and is therefore treated as a noisy observation of the latent clean molecular property labely∈𝒴y\\in\\mathcal\{Y\}\. Given a noisy multimodal training set𝒟=\{\(Gi,Ti,y~i\)\}i=1N\\mathcal\{D\}=\\\{\(G\_\{i\},T\_\{i\},\\tilde\{y\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}, whereGiG\_\{i\}andTiT\_\{i\}are the graph and text views of moleculeii, our goal is to learn a clean molecular property posterior

𝐩i=pθ​\(yi∣Gi,Ti\)∈ΔC−1\.\\mathbf\{p\}\_\{i\}=p\_\{\\theta\}\\left\(y\_\{i\}\\mid G\_\{i\},T\_\{i\}\\right\)\\in\\Delta^\{C\-1\}\.\(1\)Here,𝒴=\{1,…,C\}\\mathcal\{Y\}=\\\{1,\\ldots,C\\\}is the categorical label space,𝐩i=\(pi,1,…,pi,C\)\\mathbf\{p\}\_\{i\}=\(p\_\{i,1\},\\ldots,p\_\{i,C\}\)is the posterior used for inference, andΔC−1\\Delta^\{C\-1\}denotes the probability simplex over𝒴\\mathcal\{Y\}, i\.e\. the set of non\-negativeCC\-dimensional vectors whose entries sum to one\.

### 2\.2Overview of MOLAR

![Refer to caption](https://arxiv.org/html/2606.18390v1/x2.png)Figure 2:Framework overview of MOLAR\. Graph and text views are composed into a latent clean\-property posterior, which is connected to recorded noisy labels through a categorical label\-observation channel\. The framework also derives posterior reliability and regularizes evidence conflict and perturbation consistency\.As shown in Figure[2](https://arxiv.org/html/2606.18390#S2.F2), MOLAR is a noise\-aware framework for learning multimodal molecular representations from noisy labels\. The framework contains four modules\. Themolecular evidence initializationmodule encodes the molecular graphGiG\_\{i\}and the text\-derived viewTiT\_\{i\}into modality\-specific representations\. Theresidual natural\-parameter evidence compositionmodule maps these representations to graph\- and text\-derived evidence and combines them into a latent clean categorical distribution\. Thecategorical label\-observation channellinks the clean posterior to the recorded\-label distribution, allowing noisy labels to provide supervision without being treated as clean targets\. Thenoise\-aware learning objectivecombines recorded\-label likelihood with clean\-evidence regularization to reduce contradictory graph–text evidence and improve perturbation\-consistent clean posterior prediction\.

### 2\.3Molecular evidence initialization

Given the graph and text\-derived views of moleculeii, MOLAR first encodes them into modality\-specific molecular representations\. The graph viewGiG\_\{i\}is processed by a graph encoderfgf\_\{g\}, which can be instantiated by common message\-passing architectures such as GCN, GAT, or GIN\. The text\-derived viewTiT\_\{i\}is encoded by a text\-side encoderftf\_\{t\}, which can be implemented using a pretrained molecular or biomedical language model, or a lightweight encoder over precomputed molecular text embeddings\. The two representations are defined as

𝐳ig=fg​\(Gi\)∈ℝd,𝐳it=ft​\(Ti\)∈ℝd\.\\mathbf\{z\}\_\{i\}^\{g\}=f\_\{g\}\(G\_\{i\}\)\\in\\mathbb\{R\}^\{d\},\\qquad\\mathbf\{z\}\_\{i\}^\{t\}=f\_\{t\}\(T\_\{i\}\)\\in\\mathbb\{R\}^\{d\}\.\(2\)
The graph representation𝐳ig\\mathbf\{z\}\_\{i\}^\{g\}summarizes structural information from the atom–bond topology, including local chemical neighborhoods and global connectivity patterns\. The text representation𝐳it\\mathbf\{z\}\_\{i\}^\{t\}captures complementary semantic, physicochemical, or pharmacological information from molecular descriptions or text\-derived embeddings\. These two modality\-specific representations serve as the initial molecular evidence for clean\-property prediction\.

### 2\.4Residual natural\-parameter evidence composition

A common multimodal strategy is to concatenate graph and text representations, or to produce modality\-specific predictions and combine them with a fusion module\. Such designs can be fragile under noisy supervision: if the recorded label is corrupted, probability\-level fusion or forced graph–text alignment may propagate label\-induced errors across modalities\. They also make it difficult to determine whether a prediction is supported by structural evidence, text\-derived evidence, or both\.

To preserve modality\-specific evidence while forming a single clean\-property predictor, MOLAR composes graph and text information in the natural\-parameter space of a categorical distribution\. For a categorical variable withCCclasses, the logit vector is a natural parameter and is identifiable up to an additive constant\. In this space, additive residuals correspond to additive changes in class\-relative evidence\. We fix the arbitrary offset with the centering operator

𝒞​\(𝐯\)=𝐯−1C​\(𝟏C⊤​𝐯\)​𝟏C,𝐯∈ℝC,\\mathcal\{C\}\(\\mathbf\{v\}\)=\\mathbf\{v\}\-\\frac\{1\}\{C\}\\left\(\\mathbf\{1\}\_\{C\}^\{\\top\}\\mathbf\{v\}\\right\)\\mathbf\{1\}\_\{C\},\\qquad\\mathbf\{v\}\\in\\mathbb\{R\}^\{C\},\(3\)where𝟏C\\mathbf\{1\}\_\{C\}is the all\-one vector\. The operator removes the common additive component and keeps only class\-relative evidence\.

Given𝐳ig\\mathbf\{z\}\_\{i\}^\{g\}and𝐳it\\mathbf\{z\}\_\{i\}^\{t\}, two learnable evidence functions produce class\-wise residual natural parameters:

𝐮ig=𝒞​\(ϕg​\(𝐳ig\)\),𝐮it=𝒞​\(ϕt​\(𝐳it\)\),\\mathbf\{u\}\_\{i\}^\{g\}=\\mathcal\{C\}\\left\(\\phi\_\{g\}\(\\mathbf\{z\}\_\{i\}^\{g\}\)\\right\),\\qquad\\mathbf\{u\}\_\{i\}^\{t\}=\\mathcal\{C\}\\left\(\\phi\_\{t\}\(\\mathbf\{z\}\_\{i\}^\{t\}\)\\right\),\(4\)whereϕg:ℝd→ℝC\\phi\_\{g\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{C\}andϕt:ℝd→ℝC\\phi\_\{t\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{C\}are graph and text evidence functions, respectively\. The componentui,cgu^\{g\}\_\{i,c\}represents graph\-derived residual evidence for classcc, andui,ctu^\{t\}\_\{i,c\}represents text\-derived residual evidence for the same class\. Positive residual evidence increases the relative support for a class, whereas values close to zero indicate weak modality\-specific evidence\.

We further introduce a base natural\-parameter vector𝐛∈ℝC\\mathbf\{b\}\\in\\mathbb\{R\}^\{C\}to represent class\-level prior tendency\. The clean\-property logit is defined by additive evidence composition:

ℓi=𝒞​\(𝐛\+𝐮ig\+𝐮it\)\.\\boldsymbol\{\\ell\}\_\{i\}=\\mathcal\{C\}\\left\(\\mathbf\{b\}\+\\mathbf\{u\}\_\{i\}^\{g\}\+\\mathbf\{u\}\_\{i\}^\{t\}\\right\)\.\(5\)The clean\-property posterior is obtained by normalizing the clean natural parameters:

𝐩i=softmax⁡\(ℓi\),pi,c=pθ​\(yi=c∣Gi,Ti\)\.\\mathbf\{p\}\_\{i\}=\\operatorname\{softmax\}\\left\(\\boldsymbol\{\\ell\}\_\{i\}\\right\),\\qquad p\_\{i,c\}=p\_\{\\theta\}\\left\(y\_\{i\}=c\\mid G\_\{i\},T\_\{i\}\\right\)\.\(6\)Equations \([4](https://arxiv.org/html/2606.18390#S2.E4)\)–\([6](https://arxiv.org/html/2606.18390#S2.E6)\) define a single latent clean\-property distribution for moleculeii\. Instead of training independent graph, text, and fusion classifiers, the two modalities contribute residual evidence to the same categorical distribution\. The base vector𝐛\\mathbf\{b\}captures class\-level tendency, while𝐮ig\\mathbf\{u\}\_\{i\}^\{g\}and𝐮it\\mathbf\{u\}\_\{i\}^\{t\}describe molecule\-specific evidence from the graph and text views\.

This explicit decomposition also supports modality\-level interpretation\. We quantify the relative contribution of graph\-derived evidence by

Mig=‖𝐮ig‖2‖𝐮ig‖2\+‖𝐮it‖2\+ε,Mit=1−Mig,M\_\{i\}^\{g\}=\\frac\{\\left\\\|\\mathbf\{u\}\_\{i\}^\{g\}\\right\\\|\_\{2\}\}\{\\left\\\|\\mathbf\{u\}\_\{i\}^\{g\}\\right\\\|\_\{2\}\+\\left\\\|\\mathbf\{u\}\_\{i\}^\{t\}\\right\\\|\_\{2\}\+\\varepsilon\},\\qquad M\_\{i\}^\{t\}=1\-M\_\{i\}^\{g\},\(7\)whereε\>0\\varepsilon\>0is a small constant for numerical stability\. A largerMigM\_\{i\}^\{g\}indicates stronger graph\-derived evidence, whereas a largerMitM\_\{i\}^\{t\}indicates stronger text\-derived evidence\. These quantities are used for interpretation and diagnostic analysis, not as an additional fusion mechanism\.

### 2\.5Categorical label\-observation channel

The clean posterior𝐩i\\mathbf\{p\}\_\{i\}represents the predicted latent molecular property\. However, the label available for training is the recorded labely~i\\tilde\{y\}\_\{i\}, which may be affected by measurement variability, thresholding, database curation, or weak annotation\. Directly optimizing the clean posterior againsty~i\\tilde\{y\}\_\{i\}would implicitly treat the recorded label as the clean target\. To avoid this assumption, MOLAR introduces a categorical label\-observation channel that links latent clean classes to recorded labels\.

Let𝐎∈\[0,1\]C×C\\mathbf\{O\}\\in\[0,1\]^\{C\\times C\}denote the label\-observation matrix:

Oa​c=pθ​\(y~i=a∣yi=c\),a,c∈𝒴\.O\_\{ac\}=p\_\{\\theta\}\\left\(\\tilde\{y\}\_\{i\}=a\\mid y\_\{i\}=c\\right\),\\qquad a,c\\in\\mathcal\{Y\}\.\(8\)Here,ccindexes the latent clean class andaaindexes the recorded class\. Each column of𝐎\\mathbf\{O\}is a categorical distribution over recorded labels conditioned on a clean class, satisfying∑a=1COa​c=1\\sum\_\{a=1\}^\{C\}O\_\{ac\}=1for eachcc\.

Since clean labels are not observed during training, an unconstrained channel may arbitrarily exchange latent clean classes and recorded labels\. We therefore use a diagonal\-dominant parameterization to stabilize this mapping while still allowing nonzero off\-diagonal label transitions\. Specifically, let

ηa​c=\{0,a=c,−ζa​c,a≠c,ζa​c\>0,\\eta\_\{ac\}=\\begin\{cases\}0,&a=c,\\\\ \-\\zeta\_\{ac\},&a\\neq c,\\end\{cases\}\\qquad\\zeta\_\{ac\}\>0,\(9\)whereζa​c\\zeta\_\{ac\}is a positive offset for the transition from clean classccto recorded classaa\. The observation probabilities are obtained by a column\-wise softmax:

Oa​c=exp⁡\(ηa​c\)∑b=1Cexp⁡\(ηb​c\)\.O\_\{ac\}=\\frac\{\\exp\(\\eta\_\{ac\}\)\}\{\\sum\_\{b=1\}^\{C\}\\exp\(\\eta\_\{bc\}\)\}\.\(10\)This parameterization assigns the largest logit in each column to the diagonal entry while retaining nonzero probabilities for class confusions\. The channel is therefore a constrained probabilistic link between latent molecular properties and recorded labels, rather than an exact recovery of the underlying experimental or curation error process\.

Given the clean\-property posterior𝐩i\\mathbf\{p\}\_\{i\}, the distribution over recorded labels is obtained by marginalizing the latent clean class through the label\-observation channel:

𝐩~i=\(pθ​\(y~i=a∣Gi,Ti\)\)a=1C=𝐎𝐩i\.\\tilde\{\\mathbf\{p\}\}\_\{i\}=\\left\(p\_\{\\theta\}\\left\(\\tilde\{y\}\_\{i\}=a\\mid G\_\{i\},T\_\{i\}\\right\)\\right\)\_\{a=1\}^\{C\}=\\mathbf\{O\}\\mathbf\{p\}\_\{i\}\.\(11\)Equivalently, for each recorded classa∈𝒴a\\in\\mathcal\{Y\},

p~i,a=∑c=1COa​c​pi,c\.\\tilde\{p\}\_\{i,a\}=\\sum\_\{c=1\}^\{C\}O\_\{ac\}p\_\{i,c\}\.\(12\)Thus,𝐩i\\mathbf\{p\}\_\{i\}is used for clean\-property inference, whereas𝐩~i\\tilde\{\\mathbf\{p\}\}\_\{i\}is used to evaluate the likelihood of the recorded label\.

The same channel also provides a posterior reliability score for the recorded label\. Letai=y~ia\_\{i\}=\\tilde\{y\}\_\{i\}denote the recorded class of moleculeii\. We define

ρi=pθ\(yi=ai∣y~i=ai,Gi,Ti\)\.\\rho\_\{i\}=p\_\{\\theta\}\\left\(y\_\{i\}=a\_\{i\}\\mid\\tilde\{y\}\_\{i\}=a\_\{i\},G\_\{i\},T\_\{i\}\\right\)\.\(13\)By Bayes’ rule,

ρi=Oai​ai​pi,ai∑c=1COai​c​pi,c=Oai​ai​pi,aip~i,ai\.\\rho\_\{i\}=\\frac\{O\_\{a\_\{i\}a\_\{i\}\}p\_\{i,a\_\{i\}\}\}\{\\sum\_\{c=1\}^\{C\}O\_\{a\_\{i\}c\}p\_\{i,c\}\}=\\frac\{O\_\{a\_\{i\}a\_\{i\}\}p\_\{i,a\_\{i\}\}\}\{\\tilde\{p\}\_\{i,a\_\{i\}\}\}\.\(14\)Here,ρi\\rho\_\{i\}is the posterior probability that the recorded label coincides with the latent clean class under the learned clean posterior and label\-observation channel\. Thus, label reliability is obtained as a channel\-derived posterior quantity rather than as an externally designed sample weight\.

### 2\.6Noise\-aware learning objective

The learning objective follows the separation between latent clean\-property inference and noisy label observation\. The recorded\-label likelihood is evaluated after the clean posterior passes through the categorical label\-observation channel\. Two regularization terms are imposed on clean\-property evidence: one controls contradictory graph–text evidence, and the other encourages perturbation\-consistent clean posterior predictions\.

#### 2\.6\.1Recorded\-label likelihood

For moleculeii, the clean posterior𝐩i\\mathbf\{p\}\_\{i\}is mapped to the recorded\-label distribution𝐩~i=𝐎𝐩i\\tilde\{\\mathbf\{p\}\}\_\{i\}=\\mathbf\{O\}\\mathbf\{p\}\_\{i\}\. The supervised likelihood is defined as the categorical negative log\-likelihood of the recorded label:

ℒsup=−1N​∑i=1Nlog⁡\(p~i,y~i\+ε\)\.\\mathcal\{L\}\_\{\\mathrm\{sup\}\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\log\\left\(\\tilde\{p\}\_\{i,\\tilde\{y\}\_\{i\}\}\+\\varepsilon\\right\)\.\(15\)This objective uses recorded labels through the label\-observation channel, rather than directly treating them as clean targets\.

#### 2\.6\.2Posterior\-weighted evidence\-conflict regularization

To reduce harmful cross\-modal contradiction, we regularize the residual evidence from graph and text\. We first define bounded evidence vectors

𝐬ig=tanh⁡\(𝐮ig\),𝐬it=tanh⁡\(𝐮it\),\\mathbf\{s\}\_\{i\}^\{g\}=\\tanh\\left\(\\mathbf\{u\}\_\{i\}^\{g\}\\right\),\\qquad\\mathbf\{s\}\_\{i\}^\{t\}=\\tanh\\left\(\\mathbf\{u\}\_\{i\}^\{t\}\\right\),\(16\)where the operation is applied element\-wise\. Their inner product measures whether the two modalities support compatible class directions\. The evidence\-conflict score is

ci=1C​\[−⟨𝐬ig,𝐬it⟩\]\+,\[x\]\+=max⁡\(x,0\)\.c\_\{i\}=\\frac\{1\}\{C\}\\left\[\-\\left\\langle\\mathbf\{s\}\_\{i\}^\{g\},\\mathbf\{s\}\_\{i\}^\{t\}\\right\\rangle\\right\]\_\{\+\},\\qquad\[x\]\_\{\+\}=\\max\(x,0\)\.\(17\)The conflict regularizer is weighted by the posterior reliabilityρi\\rho\_\{i\}:

ℒconf=1N​∑i=1Nρi​ci\.\\mathcal\{L\}\_\{\\mathrm\{conf\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\rho\_\{i\}c\_\{i\}\.\(18\)Thus, contradictory high\-magnitude evidence is penalized more strongly when the recorded label is likely to match the latent clean class, while weak or compatible modality evidence contributes little to the penalty\.

#### 2\.6\.3Perturbation\-consistent clean posterior regularization

We further encourage the clean posterior to remain stable under label\-preserving molecular perturbations\. Letℓi\(a\)\\boldsymbol\{\\ell\}\_\{i\}^\{\(a\)\}andℓi\(b\)\\boldsymbol\{\\ell\}\_\{i\}^\{\(b\)\}be the clean logits obtained from two perturbed views of the same molecule\. We define the centered posterior score as

𝐪​\(ℓ\)=softmax⁡\(ℓ\)−1C​𝟏C\.\\mathbf\{q\}\\left\(\\boldsymbol\{\\ell\}\\right\)=\\operatorname\{softmax\}\\left\(\\boldsymbol\{\\ell\}\\right\)\-\\frac\{1\}\{C\}\\mathbf\{1\}\_\{C\}\.\(19\)The perturbation\-consistency regularizer is

ℒpert=1N​∑i=1N1C​‖𝐪​\(ℓi\(a\)\)−𝐪​\(ℓi\(b\)\)‖22\.\\mathcal\{L\}\_\{\\mathrm\{pert\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{1\}\{C\}\\left\\\|\\mathbf\{q\}\\left\(\\boldsymbol\{\\ell\}\_\{i\}^\{\(a\)\}\\right\)\-\\mathbf\{q\}\\left\(\\boldsymbol\{\\ell\}\_\{i\}^\{\(b\)\}\\right\)\\right\\\|\_\{2\}^\{2\}\.\(20\)Since the centered score is zero for a uniform categorical posterior and bounded across classes, this term stabilizes clean\-property predictions without enforcing over\-confident outputs\.

#### 2\.6\.4Unified training objective

The final objective is

ℒ=ℒsup\+β​ℒconf\+γ​ℒpert,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{sup\}\}\+\\beta\\mathcal\{L\}\_\{\\mathrm\{conf\}\}\+\\gamma\\mathcal\{L\}\_\{\\mathrm\{pert\}\},\(21\)whereβ≥0\\beta\\geq 0andγ≥0\\gamma\\geq 0control the strengths of evidence\-conflict regularization and perturbation\-consistent clean posterior regularization, respectively\.

## 3Experiments

### 3\.1Experimental setting

![Refer to caption](https://arxiv.org/html/2606.18390v1/x3.png)Figure 3:Dataset characteristics for natural and controlled noisy\-label evaluation\.a–d, Statistics of the naturally noisy MF\-PCBA\-Noisy7 benchmark, including the scale difference between single\-dose \(SD\) screening labels and dose\-response \(DR\) confirmatory labels, class imbalance, and SD–DR label disagreement\.e–f, Controlled label\-flipping protocol on MoleculeNet benchmarks, where training and validation labels are corrupted while test labels remain unchanged\.Datasets\.We evaluate MOLAR on two complementary groups of molecular property benchmarks: \(1\) The first group is MF\-PCBA\-Noisy7, a naturally noisy molecular activity benchmark constructed from MF\-PCBAbuterez2023mf\. MF\-PCBA links large\-scale primary screening measurements with confirmatory assays, making it suitable for studying how molecular predictors behave when recorded training labels are not fully reliable\. In MF\-PCBA\-Noisy7, single\-dose \(SD\) screening labels are used as noisy training labels, while dose\-response \(DR\) confirmatory labels are used as higher\-confidence validation and test labels\. This split reflects a realistic experimental scenario: SD assays provide broad but noisy coverage over many compounds, whereas DR assays are smaller but more reliable because activity is confirmed over multiple concentrations\. As shown in Figure[3](https://arxiv.org/html/2606.18390#S3.F3)a–d, MF\-PCBA\-Noisy7 contains seven molecular activity prediction tasks: NSD2, SKN1, CASP6, RAD52, GSK3A, GIV, and UBC13\. The SD training sets contain approximately 0\.20–0\.34 million molecules per task, whereas the DR validation and test sets contain hundreds to about one thousand molecules\. The SD labels are extremely imbalanced, with positive rates below 0\.4%, while the DR positive rates are substantially higher\. The paired SD–DR disagreement rates range from 25\.51% to 81\.54%\. \(2\) The second group consists of four standard MoleculeNet benchmarkswu2018moleculenet: HIV, BACE, BBBP, and ClinTox\. These datasets cover different molecular property prediction scenarios, including antiviral activity, enzyme inhibition, blood–brain barrier penetration, and clinical toxicity\. Unlike MF\-PCBA\-Noisy7, these benchmarks do not provide paired noisy and confirmatory labels\. We therefore use them to construct controlled noisy\-label settings with known corruption rates\. For each dataset, molecules are split into five folds with a 3:1:1 train/validation/test allocation\. Symmetric label\-flip noise is injected into the training and validation labels, while the test labels remain unchanged\. Unless otherwise specified, the label\-flipping rate is fixed at 30%\. This controlled setting complements MF\-PCBA\-Noisy7 by evaluating robustness when the corruption process is known and the held\-out test labels remain clean\. The statistics of the above datasets are summarized in Table[3](https://arxiv.org/html/2606.18390#S6.T3)and[4](https://arxiv.org/html/2606.18390#S6.T4)\.

Baselines\.We compare MOLAR with representative baselines from four categories\. The first category includes supervised graph neural networks, namely GCNGCN, GATGAT, and GINgin\. The second category includes molecular representation learning and pretraining methods, including GraphCLGraphCL, GROVEGROVE, SmiSGTSmiSGT, Uni\-Moluni\-mol, and S\-GCIBs\-gcib\. The third category includes multimodal molecular learning methods, including Tri\-SGDTri\-SGD, MMSGMMSG, MDFCLmdfcl, and ProtoMolprotomol\. The fourth category includes robust graph or noisy\-label learning methods, including OMGomg, RTGNNrtgnn, SPORTsport, and TFRTFR\. More details of these baselines are provided in Appendix[7](https://arxiv.org/html/2606.18390#S7)\.

Evaluation metrics and protocol\.The primary metric is the area under the receiver operating characteristic curve \(ROC\-AUC\), denoted as AUC\. Given prediction scores\{si\}\\\{s\_\{i\}\\\}and binary labels\{yi\}\\\{y\_\{i\}\\\}, let𝒫=\{i:yi=1\}\\mathcal\{P\}=\\\{i:y\_\{i\}=1\\\}and𝒩=\{j:yj=0\}\\mathcal\{N\}=\\\{j:y\_\{j\}=0\\\}denote the positive and negative sample sets\. AUC can be written as

AUC=1\|𝒫\|​\|𝒩\|​∑i∈𝒫∑j∈𝒩\[𝕀​\(si\>sj\)\+12​𝕀​\(si=sj\)\]\.\\mathrm\{AUC\}=\\frac\{1\}\{\|\\mathcal\{P\}\|\|\\mathcal\{N\}\|\}\\sum\_\{i\\in\\mathcal\{P\}\}\\sum\_\{j\\in\\mathcal\{N\}\}\\left\[\\mathbb\{I\}\(s\_\{i\}\>s\_\{j\}\)\+\\frac\{1\}\{2\}\\mathbb\{I\}\(s\_\{i\}=s\_\{j\}\)\\right\]\.\(22\)AUC is threshold independent and measures whether positive samples are ranked above negative samples\. Because several tasks are highly imbalanced, we also report the area under the precision–recall curve \(AUPRC\) and Matthews correlation coefficient \(MCC\)\. For a ranked prediction list, AUPRC is computed as

AUPRC=∑r=1R\(Recallr−Recallr−1\)​Precisionr,\\mathrm\{AUPRC\}=\\sum\_\{r=1\}^\{R\}\\left\(\\mathrm\{Recall\}\_\{r\}\-\\mathrm\{Recall\}\_\{r\-1\}\\right\)\\mathrm\{Precision\}\_\{r\},\(23\)whererrindexes operating points along the precision–recall curve\. MCC is defined as

MCC=TP⋅TN−FP⋅FN\(TP\+FP\)​\(TP\+FN\)​\(TN\+FP\)​\(TN\+FN\),\\mathrm\{MCC\}=\\frac\{\\mathrm\{TP\}\\cdot\\mathrm\{TN\}\-\\mathrm\{FP\}\\cdot\\mathrm\{FN\}\}\{\\sqrt\{\(\\mathrm\{TP\}\+\\mathrm\{FP\}\)\(\\mathrm\{TP\}\+\\mathrm\{FN\}\)\(\\mathrm\{TN\}\+\\mathrm\{FP\}\)\(\\mathrm\{TN\}\+\\mathrm\{FN\}\)\}\},\(24\)where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively\.

Implementation details\.All methods are implemented in PyTorch111https://pytorch\.org/\. Molecular graphs are constructed from SMILES using RDKit222https://www\.rdkit\.org/with standard atom and bond features, and text\-derived views are generated from rule\-based molecular descriptions, encoded by BiomedBERT, and cached before training\. No atom or bond deletion is applied in the default perturbation pipeline, so the molecular topology is preserved\. For MOLAR, the graph branch uses a GIN encodergin, and the text branch uses a lightweight multilayer encoder\. The hidden dimension is 256, dropout is 0\.2, the learning rate is3×10−43\\times 10^\{\-4\}, the weight decay is10−510^\{\-5\}, and the batch size is 512\. Models are optimized with Adam and selected by validation performance\. The label\-observation channel is initialized close to the identity mapping, and the perturbation\-consistency term uses label\-preserving stochastic views, including atom\-feature masking and text\-view dropout\. The regularization weightsβ\\betaandγ\\gamma, together with channel\-related hyperparameters, are selected from small validation grids\. For MF\-PCBA\-Noisy7, models are trained on noisy SD labels, selected on the DR validation split, and evaluated on the held\-out DR test split\. All experiments are repeated with five random seeds under the same settings\.

### 3\.2Performance comparison

Table 1:Classification results \(ROC\-AUC±\\pmStd %\) on MF\-PCBA\-Noisy7 and MoleculeNet benchmarks\.MethodMF\-PCBA\-Noisy7MoleculeNetNSD2SKN1CASP6RAD52GSK3AGIVUBC13HIVBACEBBBPClinToxGCN76\.1±0\.876\.1\\pm 0\.872\.5±1\.172\.5\\pm 1\.167\.9±1\.467\.9\\pm 1\.463\.2±1\.263\.2\\pm 1\.256\.4±0\.956\.4\\pm 0\.945\.3±1\.345\.3\\pm 1\.343\.1±1\.043\.1\\pm 1\.059\.1±9\.659\.1\\pm 9\.666\.5±5\.666\.5\\pm 5\.671\.5±6\.171\.5\\pm 6\.156\.7±7\.256\.7\\pm 7\.2GAT74\.6±1\.274\.6\\pm 1\.265\.2±0\.965\.2\\pm 0\.968\.1±1\.568\.1\\pm 1\.562\.4±0\.762\.4\\pm 0\.757\.8±1\.157\.8\\pm 1\.142\.2±1\.442\.2\\pm 1\.441\.6±0\.841\.6\\pm 0\.855\.3±7\.055\.3\\pm 7\.052\.6±3\.452\.6\\pm 3\.464\.6±3\.964\.6\\pm 3\.950\.7±4\.950\.7\\pm 4\.9GIN71\.9±0\.971\.9\\pm 0\.970\.9±1\.370\.9\\pm 1\.366\.3±1\.066\.3\\pm 1\.064\.5±1\.464\.5\\pm 1\.455\.5±0\.855\.5\\pm 0\.846\.8±1\.146\.8\\pm 1\.144\.5±1\.244\.5\\pm 1\.262\.2±3\.262\.2\\pm 3\.259\.1±10\.859\.1\\pm 10\.878\.3±3\.278\.3\\pm 3\.255\.4±8\.055\.4\\pm 8\.0GraphCL77\.9±1\.277\.9\\pm 1\.276\.7±0\.976\.7\\pm 0\.973\.6±1\.173\.6\\pm 1\.170\.9±0\.870\.9\\pm 0\.862\.7±1\.462\.7\\pm 1\.457\.4±1\.357\.4\\pm 1\.350\.1±1\.050\.1\\pm 1\.061\.6±4\.361\.6\\pm 4\.357\.9±7\.057\.9\\pm 7\.075\.5±2\.575\.5\\pm 2\.557\.4±10\.757\.4\\pm 10\.7GROVE82\.4±0\.782\.4\\pm 0\.774\.9±1\.474\.9\\pm 1\.473\.7±1\.273\.7\\pm 1\.272\.0±0\.972\.0\\pm 0\.965\.1±1\.565\.1\\pm 1\.551\.1±1\.151\.1\\pm 1\.150\.4±0\.850\.4\\pm 0\.862\.9±2\.362\.9\\pm 2\.366\.4±6\.466\.4\\pm 6\.480\.2±2\.480\.2\\pm 2\.456\.3±9\.556\.3\\pm 9\.5SmiSGT80\.4±1\.380\.4\\pm 1\.377\.1±0\.877\.1\\pm 0\.874\.5±1\.074\.5\\pm 1\.065\.6±1\.465\.6\\pm 1\.463\.3±0\.963\.3\\pm 0\.954\.2±1\.254\.2\\pm 1\.250\.1±0\.750\.1\\pm 0\.758\.9±5\.858\.9\\pm 5\.862\.9±7\.162\.9\\pm 7\.177\.2±2\.277\.2\\pm 2\.254\.1±8\.854\.1\\pm 8\.8Uni\-Mol78\.0±1\.178\.0\\pm 1\.175\.6±1\.575\.6\\pm 1\.575\.5±0\.875\.5\\pm 0\.867\.7±1\.367\.7\\pm 1\.364\.6±1\.064\.6\\pm 1\.046\.3±0\.946\.3\\pm 0\.951\.6±1\.451\.6\\pm 1\.456\.4±6\.756\.4\\pm 6\.758\.1±4\.758\.1\\pm 4\.773\.7±1\.573\.7\\pm 1\.553\.6±4\.253\.6\\pm 4\.2S\-GCIB84\.2±0\.984\.2\\pm 0\.976\.5±1\.276\.5\\pm 1\.274\.1±0\.774\.1\\pm 0\.772\.7±1\.172\.7\\pm 1\.162\.2±1\.362\.2\\pm 1\.348\.8±0\.848\.8\\pm 0\.850\.3±1\.050\.3\\pm 1\.060\.0±3\.260\.0\\pm 3\.266\.4±7\.566\.4\\pm 7\.580\.6±1\.680\.6\\pm 1\.660\.6±9\.260\.6\\pm 9\.2Tri\-SGD76\.6±1\.476\.6\\pm 1\.473\.1±1\.073\.1\\pm 1\.073\.0±0\.873\.0\\pm 0\.859\.9±1\.259\.9\\pm 1\.264\.4±1\.564\.4\\pm 1\.544\.4±0\.944\.4\\pm 0\.956\.8±1\.356\.8\\pm 1\.362\.5±3\.262\.5\\pm 3\.263\.2±4\.863\.2\\pm 4\.878\.7±2\.878\.7\\pm 2\.853\.0±3\.753\.0\\pm 3\.7MMSG75\.8±0\.775\.8\\pm 0\.768\.1±1\.268\.1\\pm 1\.272\.6±1\.372\.6\\pm 1\.361\.1±0\.961\.1\\pm 0\.963\.6±1\.463\.6\\pm 1\.451\.8±1\.151\.8\\pm 1\.149\.2±0\.849\.2\\pm 0\.861\.3±3\.761\.3\\pm 3\.761\.1±4\.061\.1\\pm 4\.083\.1±4\.983\.1\\pm 4\.974\.2±5\.874\.2\\pm 5\.8MDFCL74\.9±1\.174\.9\\pm 1\.169\.1±0\.869\.1\\pm 0\.871\.5±1\.071\.5\\pm 1\.066\.9±1\.566\.9\\pm 1\.559\.4±0\.759\.4\\pm 0\.754\.0±1\.354\.0\\pm 1\.351\.5±1\.451\.5\\pm 1\.466\.6±3\.766\.6\\pm 3\.768\.4±8\.668\.4\\pm 8\.685\.3±4\.685\.3\\pm 4\.679\.2±10\.879\.2\\pm 10\.8Protomol77\.3±1\.277\.3\\pm 1\.273\.8±1\.173\.8\\pm 1\.173\.7±0\.973\.7\\pm 0\.960\.7±1\.060\.7\\pm 1\.065\.1±1\.265\.1\\pm 1\.245\.2±0\.845\.2\\pm 0\.857\.6±1\.157\.6\\pm 1\.163\.4±2\.963\.4\\pm 2\.964\.2±4\.264\.2\\pm 4\.284\.5±2\.784\.5\\pm 2\.774\.3±3\.474\.3\\pm 3\.4OMG82\.1±0\.982\.1\\pm 0\.973\.4±1\.473\.4\\pm 1\.465\.8±1\.165\.8\\pm 1\.165\.2±0\.865\.2\\pm 0\.861\.6±0\.361\.6\\pm 0\.354\.4±0\.654\.4\\pm 0\.652\.6±1\.052\.6\\pm 1\.059\.6±5\.759\.6\\pm 5\.763\.4±5\.063\.4\\pm 5\.077\.2±4\.277\.2\\pm 4\.258\.0±7\.658\.0\\pm 7\.6SPORT75\.9±1\.375\.9\\pm 1\.377\.5±0\.777\.5\\pm 0\.770\.1±1\.270\.1\\pm 1\.260\.3±1\.460\.3\\pm 1\.457\.3±0\.957\.3\\pm 0\.953\.8±0\.853\.8\\pm 0\.845\.4±1\.145\.4\\pm 1\.157\.6±4\.557\.6\\pm 4\.564\.4±2\.164\.4\\pm 2\.173\.9±3\.873\.9\\pm 3\.852\.9±6\.852\.9\\pm 6\.8RTGNN60\.9±1\.560\.9\\pm 1\.554\.1±0\.854\.1\\pm 0\.845\.7±1\.445\.7\\pm 1\.466\.5±1\.166\.5\\pm 1\.146\.9±1\.246\.9\\pm 1\.253\.5±0\.953\.5\\pm 0\.949\.2±0\.849\.2\\pm 0\.861\.4±4\.061\.4\\pm 4\.069\.6±4\.269\.6\\pm 4\.276\.5±6\.876\.5\\pm 6\.852\.7±8\.452\.7\\pm 8\.4TFR77\.0±2\.577\.0\\pm 2\.570\.5±7\.170\.5\\pm 7\.168\.5±2\.968\.5\\pm 2\.970\.7±1\.370\.7\\pm 1\.363\.1±1\.463\.1\\pm 1\.455\.7±2\.955\.7\\pm 2\.955\.7±3\.055\.7\\pm 3\.062\.5±3\.062\.5\\pm 3\.065\.1±10\.765\.1\\pm 10\.775\.0±3\.775\.0\\pm 3\.755\.9±7\.555\.9\\pm 7\.5MOLAR84\.6±0\.7\\textbf\{84\.6\}\\pm 0\.778\.9±0\.6\\textbf\{78\.9\}\\pm 0\.676\.2±0\.4\\textbf\{76\.2\}\\pm 0\.474\.8±1\.1\\textbf\{74\.8\}\\pm 1\.166\.5±0\.6\\textbf\{66\.5\}\\pm 0\.660\.2±1\.0\\textbf\{60\.2\}\\pm 1\.057\.9±0\.6\\textbf\{57\.9\}\\pm 0\.667\.7±3\.0\\textbf\{67\.7\}\\pm 3\.074\.0±1\.7\\textbf\{74\.0\}\\pm 1\.789\.9±2\.2\\textbf\{89\.9\}\\pm 2\.282\.2±7\.8\\textbf\{82\.2\}\\pm 7\.8

Table[1](https://arxiv.org/html/2606.18390#S3.T1), Figure[4](https://arxiv.org/html/2606.18390#S3.F4)and[5](https://arxiv.org/html/2606.18390#S3.F5)report the performance comparisons on MF\-PCBA\-Noisy7 and MoleculeNet benchmarks\. Overall, MOLAR achieves the best performance across the two benchmark groups, with consistent improvements over graph\-based predictors, multimodal representation learning methods, and noisy\-label learning baselines\. On MF\-PCBA\-Noisy7, the improvement is more pronounced for challenging tasks with severe label inconsistency, whereas the gains are relatively moderate on tasks where existing molecular representation learners are already competitive\. On the controlled label\-flipping benchmarks, MOLAR also maintains a clear advantage under synthetic corruption\. These trends suggest that the proposed framework is particularly beneficial when recorded labels are unreliable: by separating clean\-property inference from noisy label observation and by modeling graph–text evidence explicitly, MOLAR reduces the effect of corrupted supervision while preserving the complementary information provided by the two molecular views\.

Table 2:Aggregate comparison across all reported tasks\.Boldresults indicate the best performance\.MethodNatural Avg\.Control Avg\.Overall Avg\.Avg\. Rankpp\-valueW/T/LGCN60\.6463\.4561\.6612\.183\.51×10−53\.51\{\\times\}10^\{\-5\}10/1/0GAT58\.8455\.8057\.7415\.642\.67×10−52\.67\{\\times\}10^\{\-5\}11/0/0GIN60\.0663\.7561\.4012\.451\.79×10−51\.79\{\\times\}10^\{\-5\}10/1/0GraphCL67\.0463\.1065\.617\.863\.36×10−33\.36\{\\times\}10^\{\-3\}10/1/0GROVE67\.0966\.4566\.855\.776\.91×10−36\.91\{\\times\}10^\{\-3\}8/3/0SmiSGT66\.4663\.2865\.308\.183\.30×10−33\.30\{\\times\}10^\{\-3\}10/1/0Uni\-Mol65\.6160\.4563\.749\.272\.05×10−32\.05\{\\times\}10^\{\-3\}10/1/0S\-GCIB66\.9766\.9066\.956\.323\.22×10−33\.22\{\\times\}10^\{\-3\}8/3/0Tri\-SGD64\.0364\.3564\.159\.682\.60×10−32\.60\{\\times\}10^\{\-3\}8/3/0MMSG63\.1769\.9365\.639\.951\.03×10−51\.03\{\\times\}10^\{\-5\}8/3/0MDFCL63\.9074\.8867\.897\.181\.98×10−51\.98\{\\times\}10^\{\-5\}7/4/0ProtoMol64\.7771\.6067\.256\.731\.01×10−31\.01\{\\times\}10^\{\-3\}7/4/0OMG65\.0164\.5564\.858\.774\.59×10−44\.59\{\\times\}10^\{\-4\}10/1/0SPORT62\.9062\.2062\.6511\.644\.22×10−44\.22\{\\times\}10^\{\-4\}11/0/0RTGNN53\.8365\.0557\.9112\.233\.16×10−43\.16\{\\times\}10^\{\-4\}9/2/0TFR65\.8964\.6365\.438\.142\.17×10−32\.17\{\\times\}10^\{\-3\}8/3/0MOLAR71\.3078\.4573\.901\.00––

We further summarize the aggregate comparison in Table[2](https://arxiv.org/html/2606.18390#S3.T2)\. Natural Avg\. denotes the mean ROC\-AUC over the seven MF\-PCBA\-Noisy7 tasks, Control Avg\. denotes the mean ROC\-AUC over the four MoleculeNet benchmarks, and Overall Avg\. is computed across all eleven tasks\. Avg\. Rank denotes the average task\-wise rank, where a smaller value indicates better overall ranking\. Thepp\-value is computed using a paired task\-leveltt\-test over the eleven task\-wise mean ROC\-AUC values\. For W/T/L, MOLAR is compared with each baseline on every task using the reported mean±\\pmstandard\-deviation intervals; non\-overlapping intervals in favor of MOLAR are counted as wins, overlapping intervals as ties, and non\-overlapping intervals in favor of the baseline as losses\. From these results, it can be observed that MOLAR achieves strong and consistent performance across both benchmark groups\. These findings suggest that the performance gains are systematic and robust across diverse noisy\-label settings, rather than being driven by a small number of favorable tasks\.

![Refer to caption](https://arxiv.org/html/2606.18390v1/x4.png)Figure 4:MCC\-based comparison on natural\-noise and control\-noise tasks\. \(a\) Average MCC over natural\-noise and control\-noise tasks for each method\. \(b,c\) Per\-task MCC on natural\-noise MF\-PCBA tasks and control\-noise MoleculeNet tasks, respectively\. Points denote mean MCC and horizontal bars denote standard deviation across runs; black diamonds indicate the group\-wise mean for each method\. The dashed vertical line marks zero MCC, and the shaded row highlights MOLAR\. \(d\) Dataset\-wise margin between MOLAR and the strongest baseline, computed asMCCM​O​L​A​R−maxbaseline⁡MCC\\mathrm\{MCC\}\_\{MOLAR\{\}\}\-\\max\_\{\\mathrm\{baseline\}\}\\mathrm\{MCC\}\.![Refer to caption](https://arxiv.org/html/2606.18390v1/x5.png)Figure 5:AUPRC\-based comparison on natural\-noise and control\-noise tasks\. \(a\) Average AUPRC over natural\-noise and control\-noise tasks for each method\. \(b,c\) Per\-task AUPRC on natural\-noise MF\-PCBA tasks and control\-noise MoleculeNet tasks, respectively\. Points denote mean AUPRC and horizontal bars denote standard deviation across runs; black diamonds indicate the group\-wise mean for each method\. The shaded row highlights MOLAR\. \(d\) Dataset\-wise margin between MOLAR and the strongest baseline, computed asAUPRCM​O​L​A​R−maxbaseline⁡AUPRC\\mathrm\{AUPRC\}\_\{MOLAR\{\}\}\-\\max\_\{\\mathrm\{baseline\}\}\\mathrm\{AUPRC\}\.
### 3\.3Robustness Analysis

To evaluate robustness under different levels of label corruption, we further conduct controlled experiments with label\-flipping rates ranging from 10% to 70% on MoleculeNet benchmarks\. The results are reported in Figure[6](https://arxiv.org/html/2606.18390#S3.F6)\. Across all settings, MOLAR consistently ranks among the strongest methods and achieves the best overall performance under moderate corruption rates \(10%–40%\)\. As the noise rate increases, the performance of most baselines drops rapidly, whereas MOLAR maintains comparatively stable prediction accuracy on several datasets, especially BBBP and ClinTox\. This trend suggests that separating clean\-property inference from noisy label observation enables the model to remain effective even when recorded labels become increasingly unreliable\.

![Refer to caption](https://arxiv.org/html/2606.18390v1/x6.png)Figure 6:Control\-noise robustness under increasing label\-flip ratios\. \(a–d\) ROC\-AUC heatmaps for HIV, BACE, BBBP, and ClinTox across 10–70% label flipping; cell values denote mean ROC\-AUC and short bars indicate standard deviation\. \(e\) Average ROC\-AUC across the four datasets\. \(f\) Margin between MOLAR and the strongest baseline at each flip ratio\.
### 3\.4Ablation study

![Refer to caption](https://arxiv.org/html/2606.18390v1/x7.png)Figure 7:Ablation study on natural\-noise and controlled\-noise datasets\. AUC scores are reported for different model variants, with error bars denoting standard deviations across runs\.a–d, Results on natural\-noise tasks\.e–h, Results on controlled\-noise tasks\.To evaluate the contribution of each component in MOLAR, we compare the full model with five variants: removing the evidence\-conflict regularization termℒconf\\mathcal\{L\}\_\{\\mathrm\{conf\}\}, removing the perturbation\-consistency termℒpert\\mathcal\{L\}\_\{\\mathrm\{pert\}\}, removing graph\-derived evidence, removing text\-derived evidence, and replacing the learned label\-observation channel with a shared channel\. As shown in Figure[7](https://arxiv.org/html/2606.18390#S3.F7), the full model achieves the strongest performance across both natural\-noise tasks and controlled\-noise tasks\. Removingℒconf\\mathcal\{L\}\_\{\\mathrm\{conf\}\}orℒpert\\mathcal\{L\}\_\{\\mathrm\{pert\}\}consistently weakens performance, indicating that both cross\-modal evidence\-conflict control and perturbation\-consistent clean posterior learning are important for robustness under noisy supervision\. The modality ablations further show that graph and text views provide complementary molecular evidence: removing either view leads to clear degradation, but the magnitude varies across tasks, suggesting that different molecular properties rely on different sources of information\. The shared\-channel variant also performs worse than the full model, supporting the need to explicitly learn a flexible label\-observation channel rather than using an overly restricted noisy\-label link\. Overall, these results confirm that the performance gains of MOLAR arise from the joint design of residual graph–text evidence, clean\-posterior regularization, and noise\-aware label observation\.

### 3\.5Hyperparameter analysis

![Refer to caption](https://arxiv.org/html/2606.18390v1/x8.png)Figure 8:Hyperparameter sensitivity of MOLAR\.a–d, Natural\-noise AUC heatmaps under different values ofβ\\beta,γ\\gamma, observation\-channel diagonal initialization, and observation\-channel learning\-rate scale\.e–h, Corresponding sensitivity curves on MoleculeNet datasets\. Cell colors and line values indicate ROC\-AUC\.To examine the sensitivity of MOLAR to its main hyperparameters, we vary the evidence\-conflict weightβ\\beta, the perturbation\-consistency weightγ\\gamma, the diagonal initialization of the label\-observation channel, and the learning\-rate scale for the channel parameters\. Figure[8](https://arxiv.org/html/2606.18390#S3.F8)reports the results on both MF\-PCBA\-Noisy7 and MoleculeNet benchmarks\. Overall, MOLAR is stable across a reasonable range of settings\. Forβ\\beta, performance generally improves from very small values to a moderate value, indicating that evidence\-conflict regularization is beneficial, while an overly largeβ\\betacan hurt performance by over\-penalizing useful graph–text disagreement\. Forγ\\gamma, moderate perturbation consistency gives the best results, whereas overly strong consistency tends to reduce performance on several natural\-noise tasks\. The label\-observation channel parameters show a clearer trend: initializing the channel closer to the identity mapping and using a moderate channel learning\-rate scale lead to better performance, especially on naturally noisy tasks where the relationship between latent properties and recorded labels is more uncertain\. In contrast, the controlled label\-flipping benchmarks show smoother sensitivity curves, suggesting that MOLAR does not rely on a narrowly tuned configuration\. These results highlight the importance of balancing clean\-evidence regularization with a sufficiently flexible label\-observation channel\.

### 3\.6Reliability diagnostics and molecular evidence visualization

![Refer to caption](https://arxiv.org/html/2606.18390v1/x9.png)Figure 9:Posterior reliability analysis on paired SD–DR samples\.a, Reliability distributions for SD–DR agreement and disagreement\.b, SD–DR agreement rate across reliability bins\.c,d, ROC and precision–recall curves for detecting SD–DR disagreement\.e,f, AUROC and AUPRC summaries\.![Refer to caption](https://arxiv.org/html/2606.18390v1/x10.png)Figure 10:Molecular attribution examples on MoleculeNet benchmarks\.a, Unflipped samples whereyy,y~\\tilde\{y\}, andy^\\hat\{y\}agree\.b, Flipped samples wherey^\\hat\{y\}matchesyyrather than the corruptedy~\\tilde\{y\}\. Highlighted atoms denote model\-attributed evidence for the clean prediction\.To assess whether the posterior reliability derived from the label\-observation channel reflects recorded\-label quality, we analyze paired SD–DR samples in MF\-PCBA\-Noisy7, where SD labels serve as noisy recorded labels and DR labels provide higher\-confidence references\. As shown in Figure[9](https://arxiv.org/html/2606.18390#S3.F9)a, SD–DR\-consistent samples receive substantially higher posterior reliability than inconsistent samples\. When samples are grouped by reliability, the empirical SD–DR agreement rate increases monotonically with the mean reliability, yielding a Spearman correlation ofrs=0\.98r\_\{s\}=0\.98\(Figure[9](https://arxiv.org/html/2606.18390#S3.F9)b\)\. We further use1−ρi1\-\\rho\_\{i\}as an inconsistency score for detecting SD–DR disagreement and compare it with predictive entropy, evidence conflictcic\_\{i\}, and graph–text disagreement\. The reliability\-based score achieves the strongest ROC and precision–recall performance, with AUROC 0\.76 and AUPRC 0\.79 \(Figure[9](https://arxiv.org/html/2606.18390#S3.F9)c–f\)\. These results suggest thatρi\\rho\_\{i\}captures label trustworthiness more effectively than generic uncertainty or cross\-modal disagreement alone\.

To examine whether the clean posterior can resist corrupted supervision at the molecule level, we visualize representative examples under the 30% controlled label\-flipping setting\. For unflipped samples, the clean labelyy, recorded labely~\\tilde\{y\}, and predictiony^\\hat\{y\}are consistent, and the model assigns high posterior reliability \(Figure[10](https://arxiv.org/html/2606.18390#S3.F10)a\)\. For flipped samples, the recorded label is corrupted, but the selected predictions match the clean label rather than the flipped label \(Figure[10](https://arxiv.org/html/2606.18390#S3.F10)b\)\. The highlighted atoms indicate model\-attributed regions that support the clean prediction\. These case studies provide qualitative evidence that MOLAR can rely on molecular evidence rather than directly memorizing corrupted labels; however, the highlighted substructures should be interpreted as attribution signals rather than experimentally validated chemical mechanisms\.

## 4Conclusion

We presented MOLAR, a noise\-aware framework for learning multimodal molecular representations from noisy labels\. MOLAR separates latent molecular property prediction from noisy label observation by composing graph and text views as residual evidence for a clean\-property posterior and linking it to recorded labels through a categorical label\-observation channel\. Experiments on naturally noisy and controlled label\-flipping benchmarks show strong robustness over representative graph\-based, multimodal, and noisy\-label learning baselines\. The posterior reliability and modality\-specific evidence further provide useful diagnostics for interpreting noisy labels and graph–text contributions\.

## References

\{appendices\}

## 5Notation summary

SymbolMeaningGi=\(Vi,Ei,Xi\)G\_\{i\}=\(V\_\{i\},E\_\{i\},X\_\{i\}\)Molecular graph for moleculeiiwith atoms, bonds, and atom featuresTiT\_\{i\}Text\-derived molecular view, descriptor summary, SMILES\-derived text, or cached language\-model representationy~i\\tilde\{y\}\_\{i\},yiy\_\{i\}Recorded noisy class and latent clean molecular property class for moleculeii𝒴\\mathcal\{Y\},CCCategorical label space and number of classes𝐳ig,𝐳it\\mathbf\{z\}\_\{i\}^\{g\},\\mathbf\{z\}\_\{i\}^\{t\}Graph and text molecular representations𝐮ig,𝐮it\\mathbf\{u\}\_\{i\}^\{g\},\\mathbf\{u\}\_\{i\}^\{t\}Graph\- and text\-derived residual natural\-parameter evidenceℓi\\boldsymbol\{\\ell\}\_\{i\}Clean categorical logit vector after residual evidence composition𝐩i\\mathbf\{p\}\_\{i\}Clean\-property posterior distribution used for inference𝐎\\mathbf\{O\}Categorical label\-observation matrix linking clean labels to recorded labels𝐩~i\\tilde\{\\mathbf\{p\}\}\_\{i\}Recorded\-label distribution after applying the label\-observation channelρi\\rho\_\{i\}Posterior reliability that the recorded class matches the latent clean classMig,MitM\_\{i\}^\{g\},M\_\{i\}^\{t\}Relative graph and text evidence contributions for diagnosticscic\_\{i\}Cross\-modal evidence\-conflict score

## 6Datasets

Table 3:Statistics of the MF\-PCBA\-Noisy7 datasets\.Task IDTargetNNtrain/val/testPos\. rate \(SD/DR\)SD\-DR disagr\.Avg\. atomsNotesNSD2NSD2 Methyltransferase296,386/694/6940\.0% / 24\.9%75\.1%25High noise, Extreme SD imbalanceSKN1SKN\-1 Transcription Factor344,227/663/6640\.1% / 18\.4%81\.5%24High noise, Large SD, Extreme SD imbalanceCASP6Caspase 6260,564/403/4030\.4% / 49\.4%48\.3%26Extreme SD imbalanceRAD52RAD52 DNA Repair265,560/243/2430\.2% / 73\.9%25\.5%34Low noise, DR imbalanced, Extreme SD imbalanceGSK3AGSK\-3α\\alphaKinase298,513/1023/10230\.1% / 34\.4%29\.7%24Low noise, Large DR, Extreme SD imbalanceGIVGIV\-Gα\\alphai PPI201,917/284/2850\.2% / 37\.4%43\.4%26Extreme SD imbalanceUBC13UBC13 Ubiquitin Ligase312,975/486/4870\.1% / 28\.2%71\.8%31High noise, Large SD, Extreme SD imbalance

Table 4:Statistics of the MoleculeNet datasets\.DatasetClassesGraphsAvg\. nodesAvg\. edgesHIV241,12725\.527\.5BACE21,51334\.136\.9BBBP22,03923\.925\.2ClinTox21,47826\.127\.7

## 7Baselines

We compare MOLAR with representative baselines from four groups: standard graph neural networks, molecular representation learning methods, multimodal molecular learning methods, and robust graph learning methods under noisy labels\. All methods are trained, selected, and evaluated using the same splits and evaluation protocol as MOLAR\. For MF\-PCBA\-Noisy7, models are trained on noisy SD labels and evaluated on DR labels; for MoleculeNet, models are trained with flipped training and validation labels and evaluated on clean test labels\.

GCN\.Graph Convolutional Network \(GCN\) is a standard message\-passing GNN based on localized graph convolutionGCN\. It updates atom representations by aggregating normalized neighborhood information and obtains graph\-level molecular representations through readout\. We include GCN as a basic graph\-only baseline\.

GAT\.Graph Attention Network \(GAT\) extends message passing with learnable attention weights over neighboring nodesGAT\. This allows the model to emphasize informative atom neighborhoods instead of uniformly aggregating all neighbors\. We include GAT as an attention\-based graph\-only baseline\.

GIN\.Graph Isomorphism Network \(GIN\) is an expressive GNN designed to match the discriminative power of the Weisfeiler–Lehman graph isomorphism testgin\. It uses sum aggregation and multilayer perceptrons to capture molecular substructures\. We use GIN as a strong graph\-only baseline and as the graph encoder backbone for MOLAR\.

GraphCL\.GraphCL learns graph representations by maximizing agreement between differently augmented views of the same graphGraphCL\. It uses graph augmentations such as node dropping, edge perturbation, attribute masking, and subgraph sampling\. We include GraphCL as a self\-supervised molecular graph representation baseline\.

GROVE\.GROVE, also known as GROVER, is a large\-scale self\-supervised molecular graph TransformerGROVE\. It pretrains molecular encoders with node\-level, edge\-level, and graph\-level tasks, including contextual property prediction and motif prediction\. We include GROVE as a strong pretrained graph\-based molecular representation baseline\.

SmiSGT\.SmiSGT, introduced as SimSGT, is a masked graph modeling method for molecular pretrainingSmiSGT\. It revisits the roles of graph tokenization and decoding, using a Simple GNN\-based Tokenizer and remask decoding to improve molecular representation learning\. We include it as a substructure\-aware self\-supervised baseline\.

Uni\-Mol\.Uni\-Mol is a universal 3D molecular representation learning framework pretrained on large\-scale molecular conformations and protein pocket structuresuni\-mol\. It incorporates 3D atomic coordinates through a Transformer\-based architecture and supports both property prediction and 3D molecular tasks\. We include Uni\-Mol as a pretrained 3D molecular baseline\.

S\-GCIB\.S\-GCIB is a subgraph\-conditioned graph information bottleneck method for molecular graph pretrainings\-gcib\. It learns graph cores and significant subgraphs through graph compression, subgraph extraction, and attention\-based interaction\. We include S\-GCIB as a subgraph\-aware molecular pretraining baseline\.

Tri\-SGD\.Tri\-SGD is a triple\-modal molecular fusion baseline derived from multimodal fused deep learning for drug property predictionTri\-SGD\. It processes SMILES\-encoded vectors, ECFP fingerprints, and molecular graphs with Transformer, BiGRU, and GCN modules, respectively\. We include Tri\-SGD as a representative late\-fusion multimodal baseline\.

MMSG\.MMSG jointly learns molecular representations from SMILES strings and molecular graphsMMSG\. It introduces graph\-derived bond\-level information as an attention bias in the SMILES Transformer and uses a bidirectional message communication GNN to enhance graph representations\. We include MMSG as a graph–SMILES multimodal baseline\.

MDFCL\.MDFCL is a multimodal graph contrastive learning framework for molecular property predictionmdfcl\. It integrates graph and sequence modalities and designs adaptive molecular augmentations based on backbones and side chains\. We include MDFCL as a multimodal contrastive learning baseline\.

ProtoMol\.ProtoMol is a prototype\-guided multimodal molecular learning methodprotomol\. It models molecular graphs and textual descriptions with dual\-branch encoders, cross\-modal attention, and a shared prototype space\. We include ProtoMol as a recent graph–text multimodal baseline\.

OMG\.OMG is a robust graph classification method for learning with noisy graph labelsomg\. It combines graph contrastive learning with coupled Mixup and uses neighbor\-aware noise removal to reduce the impact of unreliable labels\. We include OMG as a representative noisy\-label graph learning baseline\.

SPORT\.SPORT addresses noisy graph classification from a subgraph perspectivesport\. It represents each graph as a set of perturbed subgraphs, encodes them with an equivariant network, and updates potentially noisy labels using subgraph\-level predictions\. We include SPORT as a subgraph\-based robust learning baseline\.

RTGNN\.RTGNN is a robust training framework for GNNs under scarce and noisy labelsrtgnn\. It separates clean and noisy label candidates, applies self\-reinforcement for label correction, and uses consistency regularization to prevent overfitting\. We include RTGNN as a noise\-governance graph learning baseline\.

TFR\.TFR, or Topological Feature Reconstruction, mitigates label noise by using graph topology and feature reconstructionTFR\. It assumes that clean label patterns share more information with graph structure and node features than corrupted labels\. We include TFR as a topology\-guided noisy\-label graph learning baseline\.

Similar Articles

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

arXiv cs.LG

This paper introduces GLACIER, a multimodal student-teacher foundation model that integrates molecular graphs, SMILES strings, and physicochemical descriptors to predict molecular properties efficiently. It leverages Finsler geometry-aware fusion and knowledge distillation from larger teacher models (MiniMol, MolFormer) to achieve high performance with a lightweight architecture.

LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts

arXiv cs.LG

LongMoE proposes a unified framework that jointly addresses modality missingness and longitudinal dynamics in multimodal clinical learning, using context-aware imputation, attentional tokenization, trajectory-aware encoding, and sparse mixture-of-experts routing. Experiments on ADNI, OASIS-3, and MIMIC-IV demonstrate improved robustness under missing modalities while remaining competitive in full-modality settings.