GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

arXiv cs.LG 06/11/26, 04:00 AM Papers
Summary
This paper introduces GLACIER, a multimodal student-teacher foundation model that integrates molecular graphs, SMILES strings, and physicochemical descriptors to predict molecular properties efficiently. It leverages Finsler geometry-aware fusion and knowledge distillation from larger teacher models (MiniMol, MolFormer) to achieve high performance with a lightweight architecture.
arXiv:2606.11382v1 Announce Type: new Abstract: Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases, limiting their scalability. Most large-scale models are unimodal in nature and overlook the potential to leverage complementary molecular data modalities. To address these shortcomings, this paper introduces the Graph-Language Alignment for Chemical Inference and Exploration using Representations (GLACIER) model, a student-teacher framework that integrates molecular graphs, SMILES strings, and physicochemical descriptors to learn rich molecular embeddings. Our framework consists of three stages: (1) we pretrain three student encoders on 100,000 drug-like molecules: a message-passing neural network for molecular graphs, a transformer-based encoder for SMILES strings, and a multilayer perceptron for physicochemical descriptors, (2) we fuse these student modalities using a novel Finsler geometry-aware module, and (3) distill complementary knowledge from large teacher models, including MiniMol and MolFormer, into a single lightweight model via contrastive learning. We demonstrate that GLACIER is a robust framework that delivers high predictive performance and computational efficiency in complex molecular property prediction tasks. Our code is publicly available at https://github.com/eemokey/glacier.
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:47 PM
# GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction
Source: [https://arxiv.org/html/2606.11382](https://arxiv.org/html/2606.11382)
Emily Nguyen[0000\-0003\-4917\-7336](https://orcid.org/0000-0003-4917-7336)Department of Computer ScienceUniversity of Southern CaliforniaLos AngelesCaliforniaUSA[emilyn98@usc\.edu](https://arxiv.org/html/2606.11382v1/mailto:[email protected])Yongchan Hong[0009\-0009\-8866\-1690](https://orcid.org/0009-0009-8866-1690)Department of Quantitative and Computational BiologyUniversity of Southern CaliforniaLos AngelesCaliforniaUSA[hongyong@usc\.edu](https://arxiv.org/html/2606.11382v1/mailto:[email protected]),Harsh Toshniwal[0009\-0008\-2244\-9497](https://orcid.org/0009-0008-2244-9497)Department of Computer ScienceUniversity of Southern CaliforniaLos AngelesCaliforniaUSA[htoshniw@usc\.edu](https://arxiv.org/html/2606.11382v1/mailto:[email protected]),Yan Liu[0000\-0002\-7055\-9518](https://orcid.org/0000-0002-7055-9518)AmazonDepartment of Computer ScienceUniversity of Southern CaliforniaLos AngelesCaliforniaUSA[yanliu@cs\.usc\.edu](https://arxiv.org/html/2606.11382v1/mailto:[email protected])andAndreas Luttens[0000\-0003\-2915\-7901](https://orcid.org/0000-0003-2915-7901)Department of Medical Biochemistry and Biophysics Science for Life LaboratoryKarolinska InstitutetStockholmSweden[andreas\.luttens@ki\.se](https://arxiv.org/html/2606.11382v1/mailto:[email protected])

###### Abstract\.

Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds\. However, the computational burden to develop and deploy state\-of\-the\-art models continuously increases, limiting their scalability\. Most large\-scale models are unimodal in nature and overlook the potential to leverage complementary molecular data modalities\. To address these shortcomings, this paper introduces the Graph\-Language Alignment for Chemical Inference and Exploration using Representations \(GLACIER\) model, a student\-teacher framework that integrates molecular graphs, SMILES strings, and physicochemical descriptors to learn rich molecular embeddings\. Our framework consists of three stages: \(1\) we pretrain three student encoders on100,000100,000drug\-like molecules: a message\-passing neural network for molecular graphs, a transformer\-based encoder for SMILES strings, and a multilayer perceptron for physicochemical descriptors, \(2\) we fuse these student modalities using a novel Finsler geometry\-aware module, and \(3\) distill complementary knowledge from large teacher models, including MiniMol and MolFormer, into a single lightweight model via contrastive learning\. We demonstrate that GLACIER is a robust framework that delivers high predictive performance and computational efficiency in complex molecular property prediction tasks\. Our code is publicly available at[https://github\.com/eemokey/glacier](https://github.com/eemokey/glacier)\.

Molecular Property Prediction, Multimodal Learning, Foundation Model, Contrastive Learning, Knowledge Distillation, Finsler Geometry, Molecular Representation Learning, Drug Discovery

††ccs:Computing methodologies Machine learning††ccs:Applied computing Chemistry![Refer to caption](https://arxiv.org/html/2606.11382v1/fig1.png)Figure 1\.Model performance \(AUROC\) versus model parameter count \(left\) and model inference time per molecule \(right\)\.Scatter plot with points for each model for efficiency vs AUROC\.![Refer to caption](https://arxiv.org/html/2606.11382v1/fig2.png)Figure 2\.Overview of the GLACIER framework: In Step 1, GLACIER is instantiated as a multimodal foundation model pretrained using 100,000 molecules sampled from the Enamine REAL database\. The architecture processes each molecule across three modalities — molecular graphs, SMILES strings, and physicochemical descriptors — to capture a comprehensive molecular representation\. In Step 2, the disparate modality representations obtained from Step 1 are integrated using a novel Finsler geometry\-aware fusion mechanism that dynamically fuses graph, text, and tabular embeddings\. In Step 3, the model is pretrained via teacher\-to\-student knowledge distillation using a contrastive objective that aligns the fused student embedding with fixed, large\-scale teacher model embeddings\. Finally, the model can be applied to downstream tasks\.Illustration of the GLACIER model architecture\. Three student encoders using complementary molecular representations are geometrically fused\. The resulting foundation model can be trained for downstream property prediction tasks relevant for drug discovery\.## 1\.Introduction

Safe and efficacious drugs must exhibit a specific set of molecular properties, including potency against a drug target, selectivity, favorable pharmacokinetics and pharmacodynamics, and low toxicity\([17](https://arxiv.org/html/2606.11382#bib.bib1)\)\. Identifying molecules that satisfy these requirements is a lengthy and costly undertaking, often involving many cycles of design, synthesis, and experimental evaluation\([51](https://arxiv.org/html/2606.11382#bib.bib2)\)\. To accelerate drug discovery, deep learning models are trained on chemical datasets to learn relationships between molecular structure and target properties, including biological activity and absorption, distribution, metabolism, excretion, and toxicity \(ADMET\) endpoints\([53](https://arxiv.org/html/2606.11382#bib.bib17),[47](https://arxiv.org/html/2606.11382#bib.bib3),[46](https://arxiv.org/html/2606.11382#bib.bib4)\)\. These models enable a more efficient prioritization of promising candidate compounds for downstream experimental evaluation\([45](https://arxiv.org/html/2606.11382#bib.bib16),[26](https://arxiv.org/html/2606.11382#bib.bib18),[33](https://arxiv.org/html/2606.11382#bib.bib51)\)\.

Achieving this requires information\-rich molecular representations and algorithms capable of mapping these representations to their corresponding properties\. One promising approach is the use of chemical foundation models, which are first pretrained on large datasets to learn general chemical representations and then refined for specific downstream tasks using minimal additional data\([12](https://arxiv.org/html/2606.11382#bib.bib32),[7](https://arxiv.org/html/2606.11382#bib.bib27),[43](https://arxiv.org/html/2606.11382#bib.bib30)\)\. To assess the predictive performance of these models, standardized benchmark datasets with experimentally measured properties are essential\. Several public datasets, including Therapeutics Data Commons \(TDC\) and MoleculeNet, now serve as standard evaluation resources\([52](https://arxiv.org/html/2606.11382#bib.bib19),[18](https://arxiv.org/html/2606.11382#bib.bib15)\)\.

Many deep learning models achieve strong performance in molecular property prediction, but lack a more comprehensive chemical representation, struggle to generalize to different downstream tasks, or are very resource\-intensive\([42](https://arxiv.org/html/2606.11382#bib.bib35),[43](https://arxiv.org/html/2606.11382#bib.bib30),[56](https://arxiv.org/html/2606.11382#bib.bib36)\)\. This observation motivates the development of a lightweight model that leverages multiple molecular modalities for enhanced feature representation while supporting rapid deployment without compromising accuracy\([56](https://arxiv.org/html/2606.11382#bib.bib36),[54](https://arxiv.org/html/2606.11382#bib.bib7),[24](https://arxiv.org/html/2606.11382#bib.bib25)\)\.

In this work, our contributions are as follows:

1. \(1\)We propose Graph\-Language Alignment for Chemical Inference and Exploration using Representations \(GLACIER\), a multimodal foundation model that learns unified molecular representations by distilling knowledge from state\-of\-the\-art teacher models through contrastive pretraining on just100,000100,000drug\-like molecules\.
2. \(2\)We introduce a novel Finsler\([5](https://arxiv.org/html/2606.11382#bib.bib12),[8](https://arxiv.org/html/2606.11382#bib.bib13)\)geometry\-aware fusion mechanism for multimodal molecular representation learning, using a shared Randers space to dynamically align graph, SMILES\([50](https://arxiv.org/html/2606.11382#bib.bib60)\), and physicochemical descriptor embeddings and integrate complementary chemical information\.
3. \(3\)We demonstrate that compact multimodal foundation models can rival and surpass substantially larger models, achieving state\-of\-the\-art performance across molecular property prediction benchmarks while remaining lightweight and fast at inference\. Our code and tutorials are publicly available at[https://github\.com/eemokey/glacier](https://github.com/eemokey/glacier)\.

## 2\.Related work

### 2\.1\.Molecular representation learning

Existing molecular representation learning approaches can be broadly classified into three categories\([9](https://arxiv.org/html/2606.11382#bib.bib10),[37](https://arxiv.org/html/2606.11382#bib.bib59)\): \(1\) Graph neural network\-based approaches: Methods such as GraphMVP\([31](https://arxiv.org/html/2606.11382#bib.bib23)\)and GraphFP\([32](https://arxiv.org/html/2606.11382#bib.bib24)\)leverage contrastive learning frameworks, while MiniMol\([24](https://arxiv.org/html/2606.11382#bib.bib25)\)and Chemeleon\([4](https://arxiv.org/html/2606.11382#bib.bib26)\)provide structural insight, but are memory\-intensive\. \(2\) Transformer\-based approaches: Models such as ChemBERTa\([7](https://arxiv.org/html/2606.11382#bib.bib27),[44](https://arxiv.org/html/2606.11382#bib.bib29)\), MolFormer\([43](https://arxiv.org/html/2606.11382#bib.bib30)\), ChemGPT\([13](https://arxiv.org/html/2606.11382#bib.bib31)\), ChemFM\([12](https://arxiv.org/html/2606.11382#bib.bib32)\), MolBERT\([28](https://arxiv.org/html/2606.11382#bib.bib33)\), and SimSon\([27](https://arxiv.org/html/2606.11382#bib.bib34)\)improve the learning of global molecular representations with the self\-attention mechanism, but they suffer from quadratic complexity\([49](https://arxiv.org/html/2606.11382#bib.bib44)\)\. \(3\) Hybrid\-based approaches: Models that combine both graph\-based and transformer\-based approaches include GROVER\([42](https://arxiv.org/html/2606.11382#bib.bib35)\), Uni\-Mol\([56](https://arxiv.org/html/2606.11382#bib.bib36),[19](https://arxiv.org/html/2606.11382#bib.bib37)\), and RMAT\([34](https://arxiv.org/html/2606.11382#bib.bib38)\)\. However, these similarly suffer from high computational complexity that leads to longer training and inference times\([23](https://arxiv.org/html/2606.11382#bib.bib9)\)\. To tackle scalability challenges, knowledge distillation has emerged as a promising strategy, in which knowledge is transferred from large or ensemble teacher models to lightweight students\([10](https://arxiv.org/html/2606.11382#bib.bib8)\)\. Despite the efficiency benefits of this paradigm, most molecular distillation methods are unimodal, and therefore overlook complementary insights present in different molecular representations\. GLACIER distills the knowledge from large\-scale chemical foundation models into a single lightweight model that integrates multimodal representations to overcome the challenges present in existing molecular property prediction approaches\.

### 2\.2\.Multimodal learning

Multimodal learning encompasses approaches that align or fuse data types for robust inference\. The fusion of modalities such as molecular graphs, SMILES strings, and physicochemical descriptors remains challenging\([9](https://arxiv.org/html/2606.11382#bib.bib10)\)\. Existing fusion methods include simple concatenation, cross\-attention, and contrastive learning that align data into shared spaces\([39](https://arxiv.org/html/2606.11382#bib.bib11)\)\. Recent multimodal works include CL\-FMAP\([55](https://arxiv.org/html/2606.11382#bib.bib39)\)\(molecular graph, SMILES strings, Morgan fingerprints\) and COATI\([22](https://arxiv.org/html/2606.11382#bib.bib40)\)\(3D molecular conformers, SMILES\), which leverage contrastive alignment across heterogeneous molecular representations to substantially improve model performance\. Additional multimodal works include GIT\-Mol\([30](https://arxiv.org/html/2606.11382#bib.bib41)\)\(molecular graph, SMILES strings, images\) and FineMolTex\([29](https://arxiv.org/html/2606.11382#bib.bib42)\)\(molecular graphs, textual descriptions\) that merge modalities via cross\-attention, further demonstrating the benefits of fusing structural and semantic molecular information\. Following the precedent set by these works, we propose a framework that leverages geometrically fused representations of molecular graphs, SMILES strings, and physicochemical descriptors as an effective interface for distilling complementary knowledge from diverse teacher architectures into a single efficient model\.

## 3\.The proposed approach

In this section, we provide a detailed description of GLACIER’s multimodal student\-teacher distillation framework, as illustrated in Figure[2](https://arxiv.org/html/2606.11382#S0.F2)111Created in BioRender\. Nguyen, E\. \(2026\)[https://BioRender\.com/lg9qxrf](https://biorender.com/lg9qxrf)\. The architecture of the overall pipeline is presented Algorithms[1](https://arxiv.org/html/2606.11382#alg1)and Algorithm[2](https://arxiv.org/html/2606.11382#alg2)in Appendix[D](https://arxiv.org/html/2606.11382#A4)\.

### 3\.1\.Step 1: Multimodal student architectures

GLACIER integrates the information present in different modalities using encoders for each modality\. The implementation in this work combines: \(1\) a graph encoder to extract information within molecular graphs; \(2\) a text encoder to extract information within SMILES strings, and \(3\) a tabular encoder to extract information from physicochemical descriptors\.

#### 3\.1\.1\.Graph encoder

To capture topological information, we employ a Message Passing Neural Network \(MPNN\)\([14](https://arxiv.org/html/2606.11382#bib.bib43)\)\. The molecule is represented as a directed graphG=\(V,E\)G=\(V,E\), where messages are passed iteratively between bonds and capture the local chemical environment\. We performK=3K=3message passing steps\. To construct molecular embeddings𝐡graph∈ℝ300\\mathbf\{h\}\_\{graph\}\\in\\mathbb\{R\}^\{300\}, we employ an attentive aggregation mechanism \- a readout function that uses a learned weighted average to combine atom representations, enabling the model to dynamically prioritize chemically relevant substructures within a molecular graph\.

\(1\)𝐡graph=Readout\(MPNN\(G\)\)\\mathbf\{h\}\_\{graph\}=\\text\{Readout\}\(\\text\{MPNN\}\(G\)\)

#### 3\.1\.2\.Text encoder

To capture sequential chemical patterns, the text encoder uses lightweight Transformer layers, consisting ofN=2N=2layers with a hidden dimension ofdtext=128d\_\{text\}=128and eight attention heads\. First, we process SMILES strings using a custom Byte\-Pair Encoding \(BPE\) tokenizer trained on 100,000 randomly sampled molecules from the Enamine REAL database \(65 billion, version 2024\.07\)\([11](https://arxiv.org/html/2606.11382#bib.bib20)\)\. We optimize the vocabulary to a compact size ofV=8000V=8000, prioritizing the learning of chemically semantic substructures over rare character combinations\. The tokenizer maps a SMILES stringSSto a fixed\-length sequence of token indices𝐰∈ℝL\\mathbf\{w\}\\in\\mathbb\{R\}^\{L\}, defined formally as:

\(2\)𝐰=BPE\(S\),wi∈\{0,…,V−1\}\\mathbf\{w\}=\\text\{BPE\}\(S\),\\quad w\_\{i\}\\in\\\{0,\\dots,V\-1\\\}where the sequence is padded toL=512L=512and includes special delimiters to define the molecular boundary of the attention mechanism\. Then, we initialize the encoder input by summing learnable token embeddings with fixed sinusoidal positional encodings \(PEPE\) to retain sequence order information\. The sequence is processed by the Transformer layers, and the output of the last hidden layer is pooled:

\(3\)𝐡text=Pool\(Transformer\(𝐰\+PE\)\)\\mathbf\{h\}\_\{text\}=\\text\{Pool\}\(\\text\{Transformer\}\(\\mathbf\{w\}\+PE\)\)

#### 3\.1\.3\.Tabular encoder

Complementing the structural and sequential representations, we incorporate global physicochemical descriptors with a tabular encoder\. The input consists of a feature vector𝐱tab∈ℝ217\\mathbf\{x\}\_\{tab\}\\in\\mathbb\{R\}^\{217\}computed by RDKit\([41](https://arxiv.org/html/2606.11382#bib.bib55)\)\. These descriptors include molecular properties such as molecular weight, logP, and the number of hydrogen bond donors and acceptors as described in the Table[9](https://arxiv.org/html/2606.11382#A1.T9)in Appendix[A\.4](https://arxiv.org/html/2606.11382#A1.SS4)\. The encoder is structured as an MLP, which yields the descriptor embedding:

\(4\)𝐡tab=MLP\(𝐱tab\)\\mathbf\{h\}\_\{tab\}=\\text\{MLP\}\(\\mathbf\{x\}\_\{tab\}\)

### 3\.2\.Step 2: Geometry\-aware modality fusion

After processing each modality through their encoders, we transform each through a dedicated projection head \- implemented as a three\-layer MLP \- to map the representations into a shared latent space\. We denote these projected embeddings as𝐳graph\\mathbf\{z\}\_\{graph\},𝐳text\\mathbf\{z\}\_\{text\}, and𝐳tab\\mathbf\{z\}\_\{tab\}for molecular graph, text, and tabular embeddings, respectively\.

Using these modality embeddings, we propose a novel gated cross\-attention fusion mechanism modeled on Finsler geometry for molecular representation learning, specifically adapting the asymmetric Randers metric\([40](https://arxiv.org/html/2606.11382#bib.bib14),[8](https://arxiv.org/html/2606.11382#bib.bib13)\)\. Unlike Riemannian metrics which measure distance isotropically, a Randers metric incorporates a directional drift vector field, effectively reducing the cost of transport in directions aligned with the drift\. We adapt this to the semantic space by defining a drift vector𝝎\\boldsymbol\{\\omega\}derived from the text embedding𝐳text\\mathbf\{z\}\_\{text\}where𝐯=MLPdrift\(𝐳text\)\\mathbf\{v\}=\\text\{MLP\}\_\{drift\}\(\\mathbf\{z\}\_\{text\}\):

\(5\)𝝎=𝐯‖𝐯‖2\+ϵ⋅tanh⁡\(‖𝐯‖2\)\\boldsymbol\{\\omega\}=\\frac\{\\mathbf\{v\}\}\{\|\|\\mathbf\{v\}\|\|\_\{2\}\+\\epsilon\}\\cdot\\tanh\(\|\|\\mathbf\{v\}\|\|\_\{2\}\)
This creates a geometric bias where graph and tabular embeddings that align with the text’s semantic direction are considered closer and thus more relevant\.

Let𝐳text\\mathbf\{z\}\_\{text\}serve as the query and the set of complementary embeddingsS=\{𝐳graph,𝐳tab\}S=\\\{\\mathbf\{z\}\_\{graph\},\\mathbf\{z\}\_\{tab\}\\\}serve as the keys\. The asymmetric Randers distanceddis defined as the combination of the Euclidean distance and the projection onto the drift vector:

\(6\)d\(𝐳text,𝐤\)=‖𝐤−𝐳text‖2\+⟨𝐤−𝐳text,𝝎⟩d\(\\mathbf\{z\}\_\{text\},\\mathbf\{k\}\)=\\\|\\mathbf\{k\}\-\\mathbf\{z\}\_\{text\}\\\|\_\{2\}\+\\langle\\mathbf\{k\}\-\\mathbf\{z\}\_\{text\},\\boldsymbol\{\\omega\}\\rangle
An attention correction vector𝐜\\mathbf\{c\}is computed via softmax over these negative distances\. To balance the integration of this correction, we adopt a text\-contextualized approach that dynamically adjusts the importance of the modalities\. GLACIER learns a scalar amplitudeα\\alpha, which modulates a sigmoid gate\([38](https://arxiv.org/html/2606.11382#bib.bib21)\)based on the minimum geometric distance:

\(7\)γ=α\(𝐳text\)⋅σ\(−min𝐤∈S⁡d\(𝐳text,𝐤\)⋅λ\)\\gamma=\\alpha\(\\mathbf\{z\}\_\{text\}\)\\cdot\\sigma\\left\(\-\\min\_\{\\mathbf\{k\}\\in S\}d\(\\mathbf\{z\}\_\{text\},\\mathbf\{k\}\)\\cdot\\lambda\\right\)
Here, the learnable parameters serve three geometric roles: the weights ofMLPdrift\\text\{MLP\}\_\{drift\}learn the optimal semantic direction for fusion;MLPamp\\text\{MLP\}\_\{amp\}learns the confidence magnitudeα\\alpha, allowing the model to determine how much additional information to accept; and the scalarλ\\lambda\(gate sensitivity\) learns the curvature of the gating function, controlling how strictly geometric misalignment is penalized\. The text embedding is refined as𝐳^text=𝐳text\+γ𝐜\\mathbf\{\\hat\{z\}\}\_\{text\}=\\mathbf\{z\}\_\{text\}\+\\gamma\\mathbf\{c\}, and the final fused embedding𝐡fused\\mathbf\{h\}\_\{fused\}is obtained by concatenating the refined text embeddings with the molecular graph and tabular embeddings\.

### 3\.3\.Step 3: Student\-teacher knowledge distillation

To distill knowledge from large\-scale models into our lightweight architecture, we align the fused student embeddings with one or multiple fixed teacher embeddings\. We investigate distillation from two high\-performing teachers, each representing a different model family: \(1\) a graph\-based teacher, MiniMol and \(2\) a transformer\-based teacher, MolFormer\.

#### 3\.3\.1\.Projection distillation layers

We utilize diverse sets ofKKteacher models, each providing precomputed, fixed embeddings𝐭k\\mathbf\{t\}\_\{k\}of varying dimensionality and architectural origin\. To align the student with the teacher, we employ independent teacher projections\{Pk\}k=1K\\\{P\_\{k\}\\\}\_\{k=1\}^\{K\}, each consisting of a two\-layer MLP to project the embeddings of the teacher into the shared dimensiondshared=512d\_\{shared\}=512\. Simultaneously, the fused student embedding𝐡fused\\mathbf\{h\}\_\{fused\}has its own projector layer \(PSP\_\{S\}\)\. This standard module decouples the geometric fusion space from the direct gradients of the alignment loss\. The final embeddings for alignment are the following:

\(8\)𝐳S=PS\(𝐡fused\),𝐳T\(k\)=Pk\(stop\_grad\(𝐭k\)\)\\mathbf\{z\}\_\{S\}=P\_\{S\}\(\\mathbf\{h\}\_\{fused\}\),\\quad\\mathbf\{z\}\_\{T\}^\{\(k\)\}=P\_\{k\}\(\\text\{stop\\\_grad\}\(\\mathbf\{t\}\_\{k\}\)\)

#### 3\.3\.2\.Distillation objective

Standard multi\-teacher distillation often treats all teachers equally, which is suboptimal when teachers have varying expertise\. To address this, we introduce a dynamic multi\-teacher InfoNCE loss that allows the student to dynamically adjust the contribution of each teacher\([36](https://arxiv.org/html/2606.11382#bib.bib52)\)\. We employ an internal contribution head,T\(⋅\)T\(\\cdot\), a two\-layer MLP that predicts a contribution scoreτk∈\[ϵ,1\.0\]\\tau\_\{k\}\\in\[\\epsilon,1\.0\]for each teacher based on the current embedding of the student𝐳S\\mathbf\{z\}\_\{S\}\. To prevent the model from completely ignoring difficult teachers, we enforce a minimum contribution floorϵ=0\.1\\epsilon=0\.1:

\(9\)τk=σ\(MLPcontribution\(𝐳S\)\)⋅\(1−ϵ\)\+ϵ\\tau\_\{k\}=\\sigma\(\\text\{MLP\}\_\{contribution\}\(\\mathbf\{z\}\_\{S\}\)\)\\cdot\(1\-\\epsilon\)\+\\epsilon
The total loss is calculated as the weighted sum of the InfoNCE lossℒNCE\\mathcal\{L\}\_\{NCE\}for each teacher, regularized by a logarithmic term to prevent collapse:

\(10\)ℒ=∑k=1K\(τk⋅ℒNCE\(𝐳S,𝐳T\(k\)\)−log⁡\(τk\)\)\\mathcal\{L\}=\\sum\_\{k=1\}^\{K\}\\left\(\\tau\_\{k\}\\cdot\\mathcal\{L\}\_\{NCE\}\(\\mathbf\{z\}\_\{S\},\\mathbf\{z\}\_\{T\}^\{\(k\)\}\)\-\\log\(\\tau\_\{k\}\)\\right\)Thus, GLACIER can jointly learn and distill knowledge from multiple teachers\.

## 4\.Experiments

Table 1\.AUROC scores for molecular property prediction on TDC and MoleculeNet\. The best results are marked inbold, and the second\-best results areunderlined\.↑\\uparrow: the higher the better\. Values represent means and their standard deviations from three independent runs\.### 4\.1\.Pretraining GLACIER

To construct the pretraining corpus, we randomly sampled100,000100,000molecules from the Enamine REAL database \(65 billion molecules, version 2024\.07\)\([11](https://arxiv.org/html/2606.11382#bib.bib20)\), chosen for its extensive collection of synthetically accessible, drug\-like compounds\([16](https://arxiv.org/html/2606.11382#bib.bib22)\)\. An assessment of potential overlap between the pretraining corpus and downstream benchmarks is provided in Figure[6](https://arxiv.org/html/2606.11382#A1.F6)in Appendix[A\.3](https://arxiv.org/html/2606.11382#A1.SS3)\. The ChemAxon Extended SMILES \(CXSMILES\) annotations\([6](https://arxiv.org/html/2606.11382#bib.bib56)\)were removed, retaining only the canonical SMILES strings\. These standardized molecules were then used to generate the three pretraining modalities employed by GLACIER: molecular graphs, SMILES strings, and physicochemical descriptors\.

To improve representation learning during pretraining, we employed dynamic SMILES augmentation by generating a randomized valid SMILES string for each molecule at every epoch\. This approach exploits the fact that a single molecular graph can be represented by multiple equivalent SMILES strings depending on the choice of starting atom and graph traversal order\. By exposing the model to diverse textual realizations of the same underlying structure, this stochasticity reduces the reliance on specific syntactic patterns and encourages the learning of chemically invariant representations\([3](https://arxiv.org/html/2606.11382#bib.bib58)\)\. As these alternative SMILES representations are generated on\-the\-fly, they increase representation diversity without requiring additional molecular data or substantial computational overhead\.

For knowledge distillation, we used MiniMol and MolFormer as teacher models\. The teacher embeddings were extracted once and reused throughout pretraining, making the distillation process computationally efficient\. GLACIER was pretrained for 250 epochs in 5\.67 hours on a single NVIDIA RTX 4080 GPU, highlighting the modest computational requirements of the framework\. Additional implementation and hardware details are provided in Tables[7](https://arxiv.org/html/2606.11382#A1.T7)and[8](https://arxiv.org/html/2606.11382#A1.T8)in Appendix[A](https://arxiv.org/html/2606.11382#A1)\.

### 4\.2\.Molecular benchmark datasets

We evaluated GLACIER’s performance on 11 molecular property prediction tasks taken from two main benchmarks relevant for drug discovery: TDC\([18](https://arxiv.org/html/2606.11382#bib.bib15)\)and MoleculeNet\([52](https://arxiv.org/html/2606.11382#bib.bib19)\)\. These datasets span two broad property prediction scenarios: \(1\) Molecular classification datasets: AMES, BBB , Pgp, E\-Sub, E\-Inh, hERG, PAMPA, Tox21, and ToxCast; \(2\) Molecular regression datasets: ESOL and LIPO\. These datasets vary in both the number of classes, from 2 to 617 classes, and in the total number of samples, from 664 to 13,192 molecules\. This allows us to verify our distillation method for a broad range of configurations and ensure its applicability\. A numerical overview of the datasets and descriptions of their corresponding tasks are provided in Tables[10](https://arxiv.org/html/2606.11382#A2.T10)and[11](https://arxiv.org/html/2606.11382#A2.T11)in Appendix[B\.5](https://arxiv.org/html/2606.11382#A2.SS5)\.

### 4\.3\.Baselines

We compared GLACIER against a range of recent baselines spanning diverse methodologies, including graph neural network–based models \(MiniMol\([24](https://arxiv.org/html/2606.11382#bib.bib25)\)and Chemeleon\([4](https://arxiv.org/html/2606.11382#bib.bib26)\)\) and text\-based transformer models \(ChemBERTa\([44](https://arxiv.org/html/2606.11382#bib.bib29)\), MolFormer\([43](https://arxiv.org/html/2606.11382#bib.bib30)\), ChemGPT\([13](https://arxiv.org/html/2606.11382#bib.bib31)\), and ChemFM\-1B\([12](https://arxiv.org/html/2606.11382#bib.bib32)\)\)\. We also evaluated hybrid models, including RMAT\([34](https://arxiv.org/html/2606.11382#bib.bib38)\), COATI\([22](https://arxiv.org/html/2606.11382#bib.bib40)\), CL\-FMAP\([55](https://arxiv.org/html/2606.11382#bib.bib39)\), and GIT\-Mol\([30](https://arxiv.org/html/2606.11382#bib.bib41)\)\.

We organize our experiments around the following research questions \(RQs\):

- •RQ1: Does GLACIER perform well on downstream tasks?
- •RQ2: Does GLACIER outperform its baseline teachers?
- •RQ3: Does GLACIER produce interpretable embeddings?
- •RQ4: Does GLACIER have an optimal fusion mechanism, modality composition, and pretraining scale?

![Refer to caption](https://arxiv.org/html/2606.11382v1/fig3.png)Figure 3\.Performance comparison across molecular property prediction tasks\. Muted colors represent teacher baselines, while saturated colors represent their respective student version in a GLACIER\-Finsler distillation framework \(MolFormer in purple, MiniMol in blue, Mi\-Mo in orange\)\. Nine datasets are used for classification tasks, the remaining two are regression tasks\. Error bars correspond to the standard deviation of the mean across three independent runs\.Plot of student\-teacher performance comparisons where the student outperforms the teacher on most classification and regression\.Table 2\.RMSE scores for molecular property prediction on MoleculeNet\. The best results are marked inbold, and the second\-best results areunderlined\.↓\\downarrow: the lower the better\. Values represent means and their standard deviations from three independent runs\.
### 4\.4\.RQ1: Downstream property predictions

We evaluated GLACIER models built using three different teacher configurations: MolFormer as a single teacher, MiniMol as a single teacher, and the combination of MiniMol and MolFormer as teachers \(Mi\-Mo\)\. A detailed explanation on the choice of teachers is provided in Appendix[C\.1](https://arxiv.org/html/2606.11382#A3.SS1)\. For downstream evaluation, we conducted downstream fingerprinting, which is more computationally efficient and practical compared to end\-to\-end finetuning\([20](https://arxiv.org/html/2606.11382#bib.bib45),[24](https://arxiv.org/html/2606.11382#bib.bib25),[37](https://arxiv.org/html/2606.11382#bib.bib59)\)\. Specifically, we extracted frozen embeddings of the final layer of GLACIER for molecules in a given downstream task\. These embeddings were used to train a small task head \(logistic regression\) to make task\-specific predictions\. Following the benchmarks of TDC\([18](https://arxiv.org/html/2606.11382#bib.bib15)\)and MoleculeNet\([52](https://arxiv.org/html/2606.11382#bib.bib19)\), we used AUROC \(Area Under Receiver Operating Characteristic Curve\) as an evaluation metric for classification tasks and RMSE \(Root Mean Squared Error\) for regression tasks\. The molecules in each benchmark dataset underwent a standardization process using RDKit\([41](https://arxiv.org/html/2606.11382#bib.bib55)\)\. This includes the removal of salts, neutralization of charges, canonicalization of SMILES strings, and the removal of duplicates\. We then used an 80/10/10 scaffold split for training, validation, and testing to evaluate generalization to unseen chemical scaffolds\([52](https://arxiv.org/html/2606.11382#bib.bib19)\)\. More details on task evaluations are provided in Appendix[B](https://arxiv.org/html/2606.11382#A2)\.

Although the results reported in Tables[1](https://arxiv.org/html/2606.11382#S4.T1)and[2](https://arxiv.org/html/2606.11382#S4.T2)indicate that molecular property prediction remains challenging, two observations arise from our analyzes: First, GLACIER on average outperforms other models on the classification and regression benchmarks\. This suggests that geometry\-aware fusion coupled with contrastive distillation contributes to a latent space that successfully captures relevant molecular features, leading to a model that generalizes well to various property prediction tasks\. Second, we observe that compact models can outperform substantially larger foundation models, indicating that gains in predictive performance cannot be achieved through parameter scaling alone\.

![Refer to caption](https://arxiv.org/html/2606.11382v1/fig4.png)Figure 4\.Model performance \(RMSE\) versus model parameter count \(left\) and model inference time per molecule \(right\)\.Scatter plot with points for each model for efficiency vs RMSE\.![Refer to caption](https://arxiv.org/html/2606.11382v1/fig5.png)Figure 5\.\(Left\) Two\-dimensional density\-normalized scatter plot assessing the alignment between cosine similarity in the GLACIER embedding space and Tanimoto coefficients for corresponding molecules\. \(Right\) Two\-dimensional t\-SNE projection of 512\-dimensional GLACIER embeddings for molecules from the MoleculeNet ESOL dataset, illustrating the structure of the learned representation\.Interpretability plots of the GLACIER embedding space\. The left plot shows a scatter plot with a trend\. The right plot shows a t\-SNE visualization\.Table 3\.Ablation study comparing Concatenation vs Finsler fusion using AUROC scores on TDC and MoleculeNet\. Best results are marked inbold\.↑\\uparrow: higher is better\. Second\-best results areunderlined\. Values represent means and their standard deviations from three independent runs\.
### 4\.5\.RQ2: Distillation efficacy

We evaluated the performance of GLACIER models in relation to their respective teacher models across 11 benchmark datasets\. In particular, we considered the graph\-based model MiniMol and the transformer\-based model MolFormer, comparing both single\-teacher and dual\-teacher distillation strategies\. The single\-teacher variants used either MiniMol or MolFormer alone, whereas the dual\-teacher variant leveraged both models simultaneously during pretraining\. Figure[3](https://arxiv.org/html/2606.11382#S4.F3)shows the performance of our student GLACIER models compared to their baseline teachers\. Two main observations emerge\. First, GLACIER consistently achieves comparable or superior performance, outperforming its respective teacher baselines across the majority of benchmarks\. Second, distillation from complementary teachers can further improve performance, with dual\-teacher GLACIER variants \(Mi\-Mo\) in some cases surpassing single\-teacher models, suggesting that integrating knowledge from multiple teachers can yield additional gains\.

Beyond predictive performance, practical deployment requires models to be computationally efficient\. We therefore compared model performance \(AUROC for classification and RMSE regression tasks\) against parameter count and inference latency\. Details on the experimental setup are provided in Appendix[C\.2](https://arxiv.org/html/2606.11382#A3.SS2)\. We visualize model performance compared to parameter count and latency in Figures[1](https://arxiv.org/html/2606.11382#S0.F1)and[4](https://arxiv.org/html/2606.11382#S4.F4)\. Notably, we show that GLACIER achieves high AUROC and RMSE with a significantly smaller parameter count, outperforming large baseline models such as ChemFM \(1 billion parameters\)\. Moreover, GLACIER demonstrates superior performance while maintaining more efficient inference latency over other models\. These insights can be leveraged for training a smaller, faster GLACIER model from a strong teacher model such as MiniMol or MolFormer\.

### 4\.6\.RQ3: Embedding interpretability

To evaluate whether GLACIER learns interpretable molecular representations, we randomly sampled1,0001,000molecules from the Enamine REAL database that were not included in the pretraining set\. These molecules were embedded using a GLACIER model pretrained with a single MiniMol teacher\. For each molecular pair, we compared structural similarity, measured by the Tanimoto coefficient between Morgan2 fingerprints\([2](https://arxiv.org/html/2606.11382#bib.bib47)\), with representation similarity, measured as the cosine similarity between their GLACIER embeddings\. The resulting Pearson correlation \(r=0\.48r=0\.48; Figure[5](https://arxiv.org/html/2606.11382#S4.F5)\) indicates that GLACIER preserves topological similarity in the learned representation space\. In particular, structurally similar molecules tend to be embedded closer together, while the representations still capture information beyond that encoded by conventional molecular fingerprints\.

Beyond assessing the latent\-space organization of chemical structures, we further examined whether molecules with similar properties are mapped to nearby regions in the representation space\([48](https://arxiv.org/html/2606.11382#bib.bib48)\)\. To this end, we projected the 512\-dimensional GLACIER embeddings of molecules from the MoleculeNet ESOL dataset into a two\-dimensional t\-SNE space for visualization\. The resulting projection reveals clear clusters of chemically related compounds\.

Together, these findings suggest that GLACIER learns structured, property\-aware representations that are well suited for transfer to diverse downstream molecular prediction tasks\.

Table 4\.Ablation study comparing Concatenation vs Finsler fusion using RMSE scores on MoleculeNet Best results are marked inbold, and the second\-best results areunderlined\.↓\\downarrow: lower is better\. Values represent means and their standard deviations from three independent runs\.Table 5\.Average performance across all classification and regression tasks in the modality ablation study using GLACIER with MiniMol as a teacher\. Best results are marked inbold, and second\-best areunderlined\.↑\\uparrow: higher is better;↓\\downarrow: lower is better\.ModalityPerformanceGraphTextTabularAvg AUROC↑\\uparrowAvg RMSE↓\\downarrow✓✓✓0\.7990\.806✓×\\times✓0\.7930\.828✓✓×\\times0\.7920\.942×\\times✓✓0\.7770\.890✓×\\times×\\times0\.7811\.011×\\times✓×\\times0\.7691\.129×\\times×\\times✓0\.7601\.023
### 4\.7\.RQ4: Ablation studies

We conducted a series of three ablation studies designed to isolate the contributions of individual model components\. First, we evaluated the proposed Finsler fusion mechanism against two widely used multimodal integration strategies, concatenation and cross\-attention, using both MolFormer and MiniMol as teacher models\. As shown in Tables[3](https://arxiv.org/html/2606.11382#S4.T3)and[4](https://arxiv.org/html/2606.11382#S4.T4), Finsler fusion provided a modest but consistent improvement over both baselines across classification and regression benchmarks\. When MolFormer is used as the teacher model, the three fusion strategies achieve comparable performance, suggesting that the benefits of more sophisticated fusion are limited in this setting\. In contrast, with MiniMol as the teacher, Finsler fusion substantially outperforms standard cross\-attention, yielding higher average performance on both classification \(AUROC: 0\.799 vs\. 0\.783\) and regression \(RMSE: 0\.806 vs\. 1\.055\) tasks\. These results indicate that the effectiveness of multimodal fusion is influenced by the choice of teacher model, with Finsler fusion providing the greatest benefit when paired with stronger teachers and demonstrating its potential to further enhance distilled molecular representations\.

Second, to assess the contribution of each modality, we compared the full trimodal GLACIER model against both pairwise bimodal \(graph\+text, graph\+tabular, and text\+tabular\) and unimodal \(graph, text, and tabular\) variants\. As shown in Table[5](https://arxiv.org/html/2606.11382#S4.T5), the full model with MiniMol as the teacher consistently outperforms all reduced\-modality configurations, highlighting the complementary nature of the three modalities and their joint contribution to more robust and generalizable representations\. Detailed performance tables are provided in Tables[14](https://arxiv.org/html/2606.11382#A3.T14)and[15](https://arxiv.org/html/2606.11382#A3.T15)in Appendix[C\.4](https://arxiv.org/html/2606.11382#A3.SS4)\.

Third, to examine the impact of pretraining scale, we evaluated GLACIER pretrained on datasets of varying sizes \(10,000, 50,000, 100,000, and 500,000 randomly sampled molecules\) using MiniMol as the teacher model\. As shown in Table[6](https://arxiv.org/html/2606.11382#S4.T6), performance improves rapidly with increasing data size, demonstrating the data efficiency of the distillation framework, which already achieves strong results with only 10,000 molecules\. Gains then plateau, with performance peaking around 100,000 molecules and remaining stable or slightly decreasing at larger scales\. This behavior is consistent with the compact capacity of the student model \(approximately 5% of the parameters of larger foundation models\) and the nature of the distillation objective, which can saturate once sufficient coverage of the teacher’s knowledge is achieved\. Similar scaling patterns have been reported in prior work\([21](https://arxiv.org/html/2606.11382#bib.bib49),[35](https://arxiv.org/html/2606.11382#bib.bib50)\)\. Additional experimental results are provided in Tables[12](https://arxiv.org/html/2606.11382#A3.T12)and[13](https://arxiv.org/html/2606.11382#A3.T13)in Appendix[C\.3](https://arxiv.org/html/2606.11382#A3.SS3)and Table[16](https://arxiv.org/html/2606.11382#A3.T16)in Appendix[C\.5](https://arxiv.org/html/2606.11382#A3.SS5)\.

Table 6\.Average performance across all classification and regression tasks at different pretraining dataset sizes using GLACIER with MiniMol as a teacher\. The best results are marked inbold, and the second\-best results areunderlined\.↑\\uparrow: higher is better;↓\\downarrow: lower is better\.

## 5\.Conclusions

In this paper, we present GLACIER, a multimodal foundation model that distills complementary knowledge from large teacher models via contrastive learning\. GLACIER introduces a Finsler geometry\-aware fusion mechanism that bridges asymmetric modality gaps through learnable drift and dynamic gating, enabling effective integration of graph, text, and tabular modalities\.

Despite being pretrained on only 100,000 drug\-like compounds, GLACIER achieves strong and consistent performance across 11 molecular benchmark datasets while maintaining high inference efficiency, demonstrating that compact multimodal models can rival larger and more resource\-intensive approaches\.

More broadly, this work highlights the promise of multimodal distillation frameworks for scalable molecular learning and efficient discovery of compounds with desirable properties\. In large\-scale virtual screening settings involving billions of candidates, even modest improvements in predictive accuracy can substantially influence the ranking of top\-scoring molecules and downstream experimental prioritization\. By integrating complementary chemical information into a unified representation space, GLACIER supports a wide range of molecular discovery pipelines, including virtual screening and lead optimization\. To facilitate further research and adoption, we release our code and models at[https://github\.com/eemokey/glacier](https://github.com/eemokey/glacier)\.

## 6\.Limitations and ethical considerations

The results presented here suggest that GLACIER can efficiently distill knowledge from large teacher models into a compact multimodal representation while retaining strong predictive performance across diverse downstream tasks\. Nevertheless, three caveats are worth noting\.

First, GLACIER relies on the availability of strong teachers and therefore cannot be considered a fully standalone foundation model\. Although knowledge from multiple teachers can be distilled into a single student, our current implementation does not consistently improve upon the strongest teacher and may instead converge toward their average performance\. Future work may explore more effective strategies to combine complementary knowledge derived from multiple teachers\.

Second, unlike conventional Euclidean attention mechanisms, the proposed fusion module inherits the complexities of asymmetry in Finsler geometry, such as the parameters of the Finsler fusion module do not admit a closed\-form solution and may converge to local minima during optimization\([8](https://arxiv.org/html/2606.11382#bib.bib13)\)\.

Third, as with many models developed for molecular property prediction, there is potential for misuse\. Models trained on biological and toxicity\-related data could, in principle, be applied to the design of harmful compounds\. Responsible deployment and appropriate safeguards are therefore important considerations for future applications of this work\.

However, these limitations should not obscure the central finding of this study: a compact and nimble multimodal student model that achieves performance competitive with substantially larger foundation models\. The results suggest that knowledge distillation offers a promising path toward efficient and deployable molecular learning systems\.

###### Acknowledgements\.

E\.N\. was supported by NSF GRFP \(DGE\-1842487\)\. A\.L\. was supported by the SciLifeLab & Wallenberg Data Driven Life Science \(DDLS\) Program \(grant: KAW 2020\.0239\), the Swedish Research Council \(VR grant 2025\-06662\), and the Laboratory for Molecular Infection Medicine Sweden \(MIMS\) \(KAW 2023\.0159\)\. This research was enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden \(NAISS\), partially funded by the Swedish Research Council through grant agreement no\. 2022\-06725\. A\.L\. thanks OpenEye Scientific Software for the use of OEToolkits at no cost\. The authors thank Grace Yin for the artistic illustration of the GLACIER icon and Elizabeth Fife, Defu Cao, Robert Winn, Mike Gee, Bryce Kan, and Chong Liu for their feedback on the manuscript\.

## GenAI disclosure

Gemini and ChatGPT were used to refine writing grammar and construct minor code snippets\. All outputs were reviewed and verified by the authors prior to inclusion\.

## References

- N\. S\. C\. at Linköping University \(2025\)Cited by:[Table 8](https://arxiv.org/html/2606.11382#A1.T8.4.14.10.2)\.
- D\. Bajusz, A\. Rácz, and K\. Héberger \(2015\)Why is tanimoto index an appropriate choice for fingerprint\-based similarity calculations?\.J Cheminform\.\.External Links:[Document](https://dx.doi.org/10.1186/s13321-015-0069-3)Cited by:[§A\.3](https://arxiv.org/html/2606.11382#A1.SS3.p1.1),[§4\.6](https://arxiv.org/html/2606.11382#S4.SS6.p1.2)\.
- E\. J\. Bjerrum \(2017\)SMILES enumeration as data augmentation for neural network modeling of molecules\.arXiv1703\.07076\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1703.07076)Cited by:[§A\.2](https://arxiv.org/html/2606.11382#A1.SS2.p1.2),[§4\.1](https://arxiv.org/html/2606.11382#S4.SS1.p2.1)\.
- J\. Burns, A\. S\. Zalte, and W\. Green \(2025\)Descriptor\-based foundation models for molecular property prediction\.arXiv2506\.15792\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2506.15792)Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.11382#S4.SS3.p1.1)\.
- E\. Cartan \(1933\)Sur les espaces de finsler\.InComptes rendus de l’Académie des Sciences,Vol\.196,pp\. 582–586\.Cited by:[item 2](https://arxiv.org/html/2606.11382#S1.I1.i2.p1.1)\.
- ChemAxon \(2025\)Cited by:[§4\.1](https://arxiv.org/html/2606.11382#S4.SS1.p1.1)\.
- S\. Chithrananda, G\. Grand, and B\. Ramsundar \(2020\)ChemBERTa: large\-scale self\-supervised pretraining for molecular property prediction\.arXiv2010\.09885\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2010.09885)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1)\.
- T\. Dagès, S\. N\. Weber, Y\. E\. Lin, R\. Talmon, D\. Cremers, M\. Lindenbaum, A\. M\. Bruckstein, and R\. Kimmel \(2025\)Finsler multi\-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding\.Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 25842–25853\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52734.2025.02407)Cited by:[item 2](https://arxiv.org/html/2606.11382#S1.I1.i2.p1.1),[§3\.2](https://arxiv.org/html/2606.11382#S3.SS2.p2.3),[§6](https://arxiv.org/html/2606.11382#S6.p3.1)\.
- L\. David, A\. Thakkar, R\. Mercado, and O\. Engkvist \(2020\)Molecular representations in ai\-driven drug discovery: a review and practical guide\.J Cheminform\.12\.External Links:[Document](https://dx.doi.org/10.1186/s13321-020-00460-5)Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.11382#S2.SS2.p1.1)\.
- F\. Ekström Kelvinius, D\. Georgiev, A\. Toshev, and J\. Gasteiger \(2023\)Accelerating molecular graph neural networks via knowledge distillation\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 25761–25792\.External Links:[Link](https://openreview.net/forum?id=A18PgVSUgf)Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1)\.
- Enamine \(2024\)Cited by:[§C\.1](https://arxiv.org/html/2606.11382#A3.SS1.p1.1),[§3\.1\.2](https://arxiv.org/html/2606.11382#S3.SS1.SSS2.p1.5),[§4\.1](https://arxiv.org/html/2606.11382#S4.SS1.p1.1)\.
- C\. Feiyang, K\. Zacour, Z\. Tianyu, T\. Tzuen\-Rong, D\. Yongping, L\. Ling, P\. Srikanth, L\. Gang, and L\. Feng \(2025\)ChemFM as a scaling law guided foundation model pre\-trained on informative chemicals\.Commun Chem\.9\.External Links:[Document](https://dx.doi.org/10.1038/s42004-025-01793-8)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.11382#S4.SS3.p1.1)\.
- N\. C\. Frey, R\. Soklaski, S\. Axelrod, S\. Samsi, R\. G´omez\-Bombarelli, C\. W\. Coley, and V\. Gadepally \(2023\)Neural scaling of deep chemical models\.Nat Mach Intell5,pp\. 1297–1305\.External Links:[Document](https://dx.doi.org/10.1038/s42256-023-00740-3)Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.11382#S4.SS3.p1.1)\.
- J\. Gilmer, S\. S\. Schoenholz, P\. F\. Riley, O\. Vinyals, and G\. E\. Dahl \(2017\)Neural message passing for quantum chemistry\.InInternational conference on machine learning,pp\. 1263–1272\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.11382#S3.SS1.SSS1.p1.3)\.
- A\. Gretton, O\. Bousquet, A\. Smola, and B\. Schölkopf \(2005\)Measuring statistical dependence with hilbert\-schmidt norms\.InAlgorithmic Learning Theory,pp\. 63–77\.External Links:[Document](https://dx.doi.org/10.1007/11564089%5F7)Cited by:[§C\.1](https://arxiv.org/html/2606.11382#A3.SS1.p1.1)\.
- O\. O\. Grygorenko, D\. S\. Radchenko, I\. Dziuba, A\. Chuprina, K\. E\. Gubina, and Y\. S\. Moroz \(2020\)Generating multibillion chemical space of readily accessible screening compounds\.iScience23\(11\),pp\. 101681\.External Links:ISSN 2589\-0042,[Document](https://dx.doi.org/10.1016/j.isci.2020.101681)Cited by:[§4\.1](https://arxiv.org/html/2606.11382#S4.SS1.p1.1)\.
- M\. Hay, D\. W\. Thomas, J\. L\. Craighead, C\. Economides, and J\. Rosenthal \(2014\)Clinical development success rates for investigational drugs\.Nat Biotechnol32,pp\. 40–51\.External Links:[Document](https://dx.doi.org/10.1038/nbt.2786)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p1.1)\.
- K\. Huang, T\. Fu, W\. Gao, Y\. Zhao, Y\. Roohani, J\. Leskovec, C\. W\. Coley, C\. Xiao, J\. Sun, and M\. Zitnik \(2021\)Therapeutics data commons: machine learning datasets and tasks for drug discovery and development\.Proceedings of Neural Information Processing Systems, NeurIPS Datasets and Benchmarks\.External Links:[Link](https://openreview.net/forum?id=8nvgnORnoWr)Cited by:[Table 11](https://arxiv.org/html/2606.11382#A2.T11.3.11.7.1.1.1.1),[§1](https://arxiv.org/html/2606.11382#S1.p2.1),[§4\.2](https://arxiv.org/html/2606.11382#S4.SS2.p1.1),[§4\.4](https://arxiv.org/html/2606.11382#S4.SS4.p1.1)\.
- X\. Ji, Z\. Wang, Z\. Gao, H\. Zheng, L\. Zhang, G\. Ke, and W\. E \(2024\)Exploring molecular pretraining model at scale\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 46956–46978\.External Links:[Link](https://openreview.net/forum?id=64V40K2fDv)Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1)\.
- F\. Jiang, M\. Prakash, H\. Ma, J\. Deng, Y\. Guo, A\. Mollaysa, T\. Mansi, R\. Liao, and J\. Huang \(2026\)TRIDENT: tri\-modal molecular representation learning with taxonomic annotations and local correspondence\.InAdvances in Neural Information Processing Systems,Vol\.38,pp\. 174391–174419\.External Links:[Link](https://openreview.net/forum?id=M6l3pyvUfr)Cited by:[§4\.4](https://arxiv.org/html/2606.11382#S4.SS4.p1.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv2001\.08361\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2001.08361)Cited by:[§4\.7](https://arxiv.org/html/2606.11382#S4.SS7.p3.1)\.
- B\. Kaufman, E\. C\. Williams, C\. Underkoffler, R\. Pederson, N\. Mardirossian, I\. Watson, and J\. Parkhill \(2024\)COATI: multimodal contrastive pretraining for representing and traversing chemical space\.J Chem Inf Model\.64\(4\),pp\. 1145–1157\.External Links:[Document](https://dx.doi.org/10.1021/acs.jcim.3c01753)Cited by:[§2\.2](https://arxiv.org/html/2606.11382#S2.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.11382#S4.SS3.p1.1)\.
- F\. D\. Keles, P\. M\. Wijewardena, and C\. Hegde \(2023\)On the computational complexity of self\-attention\.InProceedings of The 34th International Conference on Algorithmic Learning Theory,Proceedings of Machine Learning Research, Vol\.201,pp\. 597–619\.Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1)\.
- K\. Kläser, B\. Banaszewski, S\. Maddrell\-Mander, C\. McLean, L\. Müller, A\. Parviz, S\. Huang, and A\. W\. Fitzgibbon \(2024\)MiniMol: a parameter\-efficient foundation model for molecular learning\.arXiv2404\.14986\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2404.14986)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.11382#S4.SS3.p1.1),[§4\.4](https://arxiv.org/html/2606.11382#S4.SS4.p1.1)\.
- S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton \(2019\)Similarity of neural network representations revisited\.InProceedings of the 36th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research, Vol\.97,pp\. 3519–3529\.Cited by:[§C\.1](https://arxiv.org/html/2606.11382#A3.SS1.p1.1)\.
- A\. Krishnan, M\. N\. Anahtar, J\. A\. Valeri, W\. Jin, N\. M\. Donghia, L\. Sieben, A\. Luttens, Y\. Zhang, S\. M\. Modaresi, A\. Hennes, J\. Fromer, P\. Bandyopadhyay, J\. C\. Chen, D\. Rehman, R\. Desai, P\. Edwards, R\. S\. Lach, M\. Aschtgen, M\. Gaborieau, M\. Gaetani, S\. G\. Palace, O\. Satotaka, K\. Lutete, M\. Y\. S\., B\. Bruce, C\. Jin, E\. Loh, G\. Y\. H\., S\. A\. A\., C\. C\. W\., W\. Felix, and J\. J\. Collins \(2025\)A generative deep learning approach to de novo antibiotic design\.Cell188,pp\. 5962–5979\.External Links:[Document](https://dx.doi.org/10.1016/j.cell.2025.07.033)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p1.1)\.
- C\. E\. Lee, J\. S\. Kim, J\. H\. Min, and S\. W\. Han \(2025\)SimSon: simple contrastive learning of smiles for molecular property prediction\.Bioinformatics41\(5\),pp\. btaf275\.External Links:[Document](https://dx.doi.org/10.1093/bioinformatics/btaf275)Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1)\.
- J\. Li and X\. Jiang \(2021\)Mol\-bert: an effective molecular representation with bert for molecular property prediction\.Wireless Communications and Mobile Computing2021\(1\),pp\. 7181815\.External Links:[Document](https://dx.doi.org/10.1155/2021/7181815)Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1)\.
- Y\. Li, Y\. Fang, M\. Zhang, and C\. Shi \(2025\)Advancing molecular graph\-text pre\-training via fine\-grained alignment\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2,pp\. 1589–1599\.External Links:[Document](https://dx.doi.org/10.1145/3711896.3736834)Cited by:[§2\.2](https://arxiv.org/html/2606.11382#S2.SS2.p1.1)\.
- P\. Liu, Y\. Ren, J\. Tao, and Z\. Ren \(2024\)Git\-mol: a multi\-modal large language model for molecular science with graph, image, and text\.Comput Biol Med\.,pp\. 108073\.External Links:[Document](https://dx.doi.org/10.1016/j.compbiomed.2024.108073)Cited by:[§2\.2](https://arxiv.org/html/2606.11382#S2.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.11382#S4.SS3.p1.1)\.
- S\. Liu, H\. Wang, W\. Liu, J\. Lasenby, H\. Guo, and J\. Tang \(2021\)Pre\-training molecular graph representation with 3d geometry\.arXiv2110\.07728\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2110.07728)Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1)\.
- K\. Luong and A\. K\. Singh \(2023\)Fragment\-based pretraining and finetuning on molecular graphs\.Advances in Neural Information Processing Systems36,pp\. 17584–17601\.Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1)\.
- A\. Luttens, I\. Cabeza de Vaca, L\. Sparring, J\. Brea, A\. L\. Martínez, N\. A\. Kahlous, D\. Radchenko, Y\. Moroz, M\. I\. Loza, U\. Norinder, and J\. Carlsson \(2025\)Rapid traversal of vast chemical space using machine learning\-guided docking screens\.Nat Comput Sci\.5,pp\. 301–312\.External Links:[Document](https://dx.doi.org/10.1038/s43588-025-00777-x)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p1.1)\.
- Ł\. Maziarka, D\. Majchrowski, T\. Danel, P\. Gaiński, J\. Tabor, I\. Podolak, P\. Morkisz, and S\. Jastrzębski \(2021\)Relative molecule self\-attention transformer\.J Cheminform\.16\.External Links:[Document](https://dx.doi.org/10.1186/s13321-023-00789-7)Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.11382#S4.SS3.p1.1)\.
- P\. Nakkiran, G\. Kaplun, Y\. Bansal, T\. Yang, B\. Barak, and I\. Sutskever \(2021\)Deep double descent: where bigger models and more data hurt\.J\. Stat\. Mech\.: Theory Exp\.2021\(12\),pp\. 124003\.External Links:[Document](https://dx.doi.org/10.1088/1742-5468/ac3a74)Cited by:[§4\.7](https://arxiv.org/html/2606.11382#S4.SS7.p3.1)\.
- A\. v\. d\. Oord, Y\. Li, and O\. Vinyals \(2018\)Representation learning with contrastive predictive coding\.arXiv1807\.03748\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1807.03748)Cited by:[§3\.3\.2](https://arxiv.org/html/2606.11382#S3.SS3.SSS2.p1.4)\.
- M\. Praski, J\. Adamczyk, and W\. Czech \(2025\)Benchmarking pretrained molecular embedding models for molecular representation learning\.arXiv2508\.06199\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2508.06199)Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1),[§4\.4](https://arxiv.org/html/2606.11382#S4.SS4.p1.1)\.
- Z\. Qiu, Z\. Wang, B\. Zheng, Z\. Huang, K\. Wen, S\. Yang, R\. Men, L\. Yu, F\. Huang, S\. Huang, D\. Liu, J\. Zhou, and J\. Lin \(2025\)Gated attention for large language models: non\-linearity, sparsity, and attention\-sink\-free\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2505.06708)Cited by:[§3\.2](https://arxiv.org/html/2606.11382#S3.SS2.p6.2)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever \(2021\)Learning transferable visual models from natural language supervision\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 8748–8763\.Cited by:[§2\.2](https://arxiv.org/html/2606.11382#S2.SS2.p1.1)\.
- G\. Randers \(1941\)On an asymmetrical metric in the four\-space of general relativity\.Phys\. Rev\.59,pp\. 195–199\.External Links:[Document](https://dx.doi.org/10.1103/PhysRev.59.195),[Link](https://link.aps.org/doi/10.1103/PhysRev.59.195)Cited by:[§3\.2](https://arxiv.org/html/2606.11382#S3.SS2.p2.3)\.
- RDKit \(2025\)Cited by:[§A\.4](https://arxiv.org/html/2606.11382#A1.SS4.p1.1),[§3\.1\.3](https://arxiv.org/html/2606.11382#S3.SS1.SSS3.p1.1),[§4\.4](https://arxiv.org/html/2606.11382#S4.SS4.p1.1)\.
- Y\. Rong, Y\. Bian, T\. Xu, W\. Xie, Y\. Wei, W\. Huang, and J\. Huang \(2020\)Self\-supervised graph transformer on large\-scale molecular data\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 12559–12571\.Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1)\.
- J\. Ross, B\. M\. Belgodere, V\. Chenthamarakshan, I\. Padhi, Y\. Mroueh, and P\. Das \(2021\)Large\-scale chemical language representations capture molecular structure and properties\.Nat Mach Intell4,pp\. 1256–1264\.External Links:[Document](https://dx.doi.org/10.1038/s42256-022-00580-7)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p2.1),[§1](https://arxiv.org/html/2606.11382#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.11382#S4.SS3.p1.1)\.
- R\. Singh, A\. A\. Barsainyan, R\. Irfan, C\. J\. Amorin, S\. He, T\. Davis, A\. P\. Thiagarajan, S\. Sankaran, S\. Chithrananda, W\. Aḥmad, D\. Jones, K\. S\. McLoughlin, H\. Kim, A\. Bhutani, S\. V\. Sathyanarayana, V\. Viswanathan, J\. E\. Allen, and B\. Ramsundar \(2026\)ChemBERTa\-3: an open source training framework for chemical foundation models\.Digital Discovery5,pp\. 662–685\.External Links:[Document](https://dx.doi.org/10.1039/D5DD00348B)Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.11382#S4.SS3.p1.1)\.
- J\. M\. Stokes, K\. Yang, K\. Swanson, W\. Jin, A\. Cubillos\-Ruiz, N\. M\. Donghia, C\. R\. MacNair, S\. French, L\. A\. Carfrae, Z\. Bloom\-Ackermann, V\. M\. Tran, A\. Chiappino\-Pepe, A\. H\. Badran, I\. W\. Andrews, E\. J\. Chory, G\. M\. Church, E\. D\. Brown, T\. S\. Jaakkola, R\. Barzilay, and J\. J\. Collins \(2020\)A deep learning approach to antibiotic discovery\.\.Cell180,pp\. 688–702\.External Links:[Document](https://dx.doi.org/10.1016/j.cell.2020.01.021)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p1.1)\.
- K\. Swanson, P\. Walther, J\. Leitz, S\. Mukherjee, J\. C\. Wu, R\. V\. Shivnaraine, and J\. Zou \(2024\)ADMET\-AI: a machine learning admet platform for evaluation of large\-scale chemical libraries\.Bioinformatics40\.External Links:[Document](https://dx.doi.org/10.1093/bioinformatics/btae416)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p1.1)\.
- J\. Vamathevan, D\. Clark, P\. Czodrowski, I\. Dunham, E\. Ferran, G\. Lee, B\. Li, A\. Madabhushi, P\. K\. Shah, M\. Spitzer, and S\. Zhao \(2019\)Applications of machine learning in drug discovery and development\.Nat Rev Drug Discov\.18,pp\. 463–477\.External Links:[Document](https://dx.doi.org/10.1038/s41573-019-0024-5)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p1.1)\.
- L\. Van der Maaten and G\. Hinton \(2008\)Visualizing data using t\-SNE\.Journal of Machine Learning Research9\(86\),pp\. 2579–2605\.External Links:[Link](http://jmlr.org/papers/v9/vandermaaten08a.html)Cited by:[§4\.6](https://arxiv.org/html/2606.11382#S4.SS6.p2.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1)\.
- D\. Weininger \(1988\)SMILES, a chemical language and information system\. 1\. introduction to methodology and encoding rules\.J\. Chem\. Inf\. Comput\. Sci\.28\(1\),pp\. 31–36\.External Links:[Document](https://dx.doi.org/10.1021/ci00057a005)Cited by:[item 2](https://arxiv.org/html/2606.11382#S1.I1.i2.p1.1)\.
- O\. J\. Wouters, M\. Mckee, and J\. Luyten \(2020\)Estimated research and development investment needed to bring a new medicine to market, 2009\-2018\.JAMA323,pp\. 844–853\.External Links:[Document](https://dx.doi.org/10.1001/jama.2022.14317)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p1.1)\.
- Z\. Wu, B\. Ramsundar, E\. N\. Feinberg, J\. Gomes, C\. Geniesse, A\. S\. Pappu, K\. Leswing, and V\. S\. Pande \(2017\)MoleculeNet: a benchmark for molecular machine learning\.Chem Sci\.9,pp\. 513–530\.External Links:[Document](https://dx.doi.org/10.1039/c7sc02664a)Cited by:[§B\.4](https://arxiv.org/html/2606.11382#A2.SS4.p1.1),[Table 11](https://arxiv.org/html/2606.11382#A2.T11.3.5.1.1.1.1.1),[§1](https://arxiv.org/html/2606.11382#S1.p2.1),[§4\.2](https://arxiv.org/html/2606.11382#S4.SS2.p1.1),[§4\.4](https://arxiv.org/html/2606.11382#S4.SS4.p1.1)\.
- K\. Yang, K\. Swanson, W\. Jin, C\. Coley, P\. Eiden, H\. Gao, A\. Guzman\-Perez, T\. Hopper, B\. Kelley, M\. Mathea, A\. Palmer, V\. Settels, T\. Jaakkola, K\. Jensen, and R\. Barzilay \(2019\)Analyzing learned molecular representations for property prediction\.J Chem Inf Model\.59,pp\. 3370–3388\.External Links:[Document](https://dx.doi.org/10.1021/acs.jcim.9b00237)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p1.1)\.
- Z\. Zeng, Y\. Yao, Z\. Liu, and M\. Sun \(2022\)A deep\-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals\.Nat Commun13\.External Links:[Document](https://dx.doi.org/10.1038/s41467-022-28494-3)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p3.1)\.
- G\. Zhou, S\. Janarthanan, Y\. Lu, and P\. Hu \(2025\)CL\-MFAP: a contrastive learning\-based multimodal foundation model for molecular property prediction and antibiotic screening\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fv9XU7CyN2)Cited by:[§2\.2](https://arxiv.org/html/2606.11382#S2.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.11382#S4.SS3.p1.1)\.
- G\. Zhou, Z\. Gao, Q\. Ding, H\. Zheng, H\. Xu, Z\. Wei, L\. Zhang, and G\. Ke \(2023\)Uni\-mol: a universal 3d molecular representation learning framework\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=6K2RM6wVqKu)Cited by:[§1](https://arxiv.org/html/2606.11382#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.11382#S2.SS1.p1.1)\.

## Appendix AImplementation details

### A\.1\.Model configuration

The details of GLACIER’s architecture are presented in Table[7](https://arxiv.org/html/2606.11382#A1.T7)\.

Table 7\.GLACIER Model ArchitectureComponentSubcomponentConfigurationGraph EncoderMessage Passing Steps \(KK\)3Output Dimension300Readout MechanismAttentive AggregationText EncoderTransformer Layers \(NN\)2Heads8Hidden Dimension \(dtextd\_\{text\}\)128Max Sequence Length \(LL\)512BPE Vocabulary Size \(VV\)8,000Tabular EncoderInput Feature Dimension217FusionModality Projections3\-layer MLPGeometry Parametersα\\alpha,λ\\lambda,𝝎\\boldsymbol\{\\omega\}DistillationTeacher Projections2\-layer MLPInternal ActivationsGELU
### A\.2\.Training dynamics and hardware

We optimized the network using AdamW with a uniform weight decay of 0\.01 across all modules and a cosine learning rate scheduler with warmup\. Module\-specific learning rates were set to3×10−43\\times 10^\{\-4\}for the text encoder and1×10−31\\times 10^\{\-3\}for the graph encoder, tabular encoder, and fusion components\. To prevent overfitting and encourage robust multimodal learning, we applied a dropout of 0\.1 and and SMILES data augmentation\[Bjerrum,[2017](https://arxiv.org/html/2606.11382#bib.bib58)\]\. Pretraining and downstream inference was performed with a batch size of 1024 on a workstation equipped with an Intel Core i9\-13900HX processor, 32GB of system RAM, and a single NVIDIA GeForce RTX 4080 GPU \(12GB VRAM\)\. Details on the pretraining setup and hardware are presented in Table[8](https://arxiv.org/html/2606.11382#A1.T8)\.

Table 8\.GLACIER Training Dynamics and HardwareComponentParameterConfigurationOptimizationOptimizerAdamWSchedulerCosine with WarmupBatch Size1024Max Epochs250Weight Decay0\.01Graph / Tabular / Fusion LR1×10−31\\times 10^\{\-3\}Text LR3×10−43\\times 10^\{\-4\}LossInfoNCE Temperature \(τ\\tau\)0\.07Min\. Contribution Floor \(ϵ\\epsilon\)0\.1RegularizationFusion Modality Dropout0\.1AugmentationSMILES CanonicalizationHardwareLocal Hardware1x RTX 4080 GPUNAISS Hardware\[at Linköping University,[2025](https://arxiv.org/html/2606.11382#bib.bib57)\]NVIDIA Tesla T4 GPU
### A\.3\.Similarity between training datasets

To assess the degree of structural overlap between pretraining and downstream datasets, we measured the similarity between benchmark molecules and molecules from each model’s pretraining corpus\. For GLACIER, we used all 100,000 pretraining molecules, while for models with publicly available pretraining data \(e\.g\., Git\-Mol and MolFormer\), we randomly sampled 100,000 molecules\. For each benchmark molecule, we computed the maximum nearest\-neighbor Tanimoto similarity to any molecule in the corresponding pretraining subset using Morgan fingerprints \(radius = 2, 1024 bits\) generated with RDKit \(version 2025\.09\.3\), where Tanimoto similarity corresponds to the Jaccard index between fingerprint bit vectors\[Bajuszet al\.,[2015](https://arxiv.org/html/2606.11382#bib.bib47)\]\. We then averaged these maximum similarities across each benchmark dataset to quantify its structural overlap with the pretraining corpus\. As shown in Figure[6](https://arxiv.org/html/2606.11382#A1.F6), GLACIER was pretrained on molecules that are structurally distinct from those in the downstream benchmarks, with average maximum Tanimoto similarities of at most 0\.35\. While indirect exposure through the teacher models cannot be ruled out, these results suggest minimal direct overlap between GLACIER’s pretraining data and the evaluation datasets\. In contrast, Git\-Mol and MolFormer exhibit substantially higher overlap, with average maximum similarities exceeding 0\.70 on 7 of the 11 benchmarks\. This indicates that molecules in their pretraining corpora are often highly similar to those in downstream datasets, potentially conferring an advantage during transfer learning\.

![Refer to caption](https://arxiv.org/html/2606.11382v1/fig6.png)Figure 6\.Distribution of Tanimoto similarity scores between pretraining and benchmark datasets\. \(Left\) Nearest\-neighbor Tanimoto similarity distribution for the AMES dataset\. \(Right\) Distribution of dataset\-wide average Tanimoto similarity scores across all 11 evaluation benchmarks\. Horizontal lines within the boxes denote the median value, while the outer boundaries outline the interquartile range \(IQR\)\.Distribution plot of similarity between models\.
### A\.4\.Description of tabular data

We used the 217 descriptors as computed by the RDKit \(version 2025\.09\.3\)\[RDKit,[2025](https://arxiv.org/html/2606.11382#bib.bib55)\]\. These descriptors include molecular properties such as molecular weight, logP, and the number of hydrogen bond donors and acceptors, as described in Table[9](https://arxiv.org/html/2606.11382#A1.T9)\.

Table 9\.Physicochemical descriptors used as the tabular modality input𝐱tab∈ℝ217\\mathbf\{x\}\_\{\\text\{tab\}\}\\in\\mathbb\{R\}^\{217\}in GLACIER, computed by RDKit\.

## Appendix BEvaluation tasks

We evaluated the proposed GLACIER framework in various classification and regression tasks to assess its performance and applicability scope\.

### B\.1\.Classification metrics

To quantify the models’ performance on binary and multi\-label classification tasks, we utilized the Area Under the ROC Curve \(AUROC\) metric, which measures the discriminative ability and is calculated as the area under the True Positive Rate \(TPR\) versus False Positive Rate \(FPR\) curve:

\(11\)AUROC=∫01TPR\(FPR−1\(t\)\)𝑑t\\text\{AUROC\}=\\int\_\{0\}^\{1\}\\text\{TPR\}\(\\text\{FPR\}^\{\-1\}\(t\)\)\\,dt

### B\.2\.Regression metrics

To quantify the models’ performance on regression property prediction tasks, we utilized the Root Mean Squared Error \(RMSE\) metric, which measures the square root of the average squared differences between predicted and actual values, heavily penalizing larger errors:

\(12\)RMSE=1N∑i=1N\(yi−y^i\)2\\text\{RMSE\}=\\sqrt\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(y\_\{i\}\-\\hat\{y\}\_\{i\}\)^\{2\}\}In whichyiy\_\{i\}is the ground truth,y^i\\hat\{y\}\_\{i\}is the predicted value, andy¯\\bar\{y\}is the mean of the ground truth values forNNsamples\.

### B\.3\.Robustness metric

To assess the models’ performance consistency, we report empirical means with their corresponding standard deviations \(STDEV\):

\(13\)STDEV=1S−1∑s=1S\(ms−m¯\)2\\text\{STDEV\}=\\sqrt\{\\frac\{1\}\{S\-1\}\\sum\_\{s=1\}^\{S\}\(m\_\{s\}\-\\bar\{m\}\)^\{2\}\}whereS=3S=3is the total number of scaffold splits,msm\_\{s\}represents the evaluation metric result for thess\-th split, andm¯\\bar\{m\}denotes the mean metric across all splits\.

### B\.4\.Scaffold splits

To provide a more realistic assessment of model generalization to unseen chemical structures, we evaluated all methods using scaffold\-based data splits\[Wuet al\.,[2017](https://arxiv.org/html/2606.11382#bib.bib19)\]\. For a fair comparison, all models are trained and evaluated using identical scaffold splits and random seeds\. Because scaffold splitting enforces structural dissimilarity between training and test molecules, it is substantially more challenging than random splitting and can introduce considerable performance variability, particularly on smaller datasets where the number of unique scaffolds is limited\. In this context, we observed high variance for certain baselines \(e\.g\., 0\.787 ± 0\.121 AUROC for MiniMol on the E\-Sub dataset\)\. Importantly, elevated standard deviations are not observed consistently across all models or datasets, suggesting that this effect is dataset\- and model\-dependent\.

### B\.5\.Benchmark dataset details

A numerical overview of the benchmark datasets is provided in Table[10](https://arxiv.org/html/2606.11382#A2.T10)\. Descriptions of each dataset are provided in Table[11](https://arxiv.org/html/2606.11382#A2.T11)\.

Table 10\.Benchmark dataset statistics with molecular counts and class distribution\.DatasetBenchmarkTask\# CmpdsPositive %AMESTDCClass7,25554\.46BBBTDCClass1,97276\.01E\-SubTDCClass66428\.77E\-InhTDCClass13,10419\.13PAMPATDCClass2,03485\.50PgpTDCClass1,21253\.38hERGTDCClass13,19249\.89Tox21MoleculeNetClass7,7303\.93ToxCastMoleculeNetClass8,2505\.19ESOLMoleculeNetRegr1,117–LIPOMoleculeNetRegr4,200–Table 11\.Benchmark MoleculeNet and TDC Datasets Details

## Appendix CAdditional results

### C\.1\.Selection of teacher models

Because GLACIER relies on knowledge distillation, its performance is inherently influenced by the choice of teacher models\. To identify complementary teachers that provide diverse supervisory signals, we analyzed the similarity of representations produced by candidate teacher models using Centered Kernel Alignment \(CKA\)\[Kornblithet al\.,[2019](https://arxiv.org/html/2606.11382#bib.bib53)\]\. Specifically, we computed the linear CKA between final\-layer embedding matrices generated from100,000100,000randomly selected molecules from the Enamine REAL database \(65 billion, version 2024\.07\)\[Enamine,[2024](https://arxiv.org/html/2606.11382#bib.bib20)\]\. CKA measures the similarity between representation spaces via the normalized Hilbert\-Schmidt Independence Criterion \(HSIC\) and is invariant to orthogonal transformations and isotropic scaling\[Grettonet al\.,[2005](https://arxiv.org/html/2606.11382#bib.bib54), Kornblithet al\.,[2019](https://arxiv.org/html/2606.11382#bib.bib53)\]\. To maximize the diversity of distilled knowledge, we sought teacher pairs with limited representational overlap, avoiding models with highly similar embedding spaces \(e\.g\., CKA ¿ 0\.80\)\. Based on this analysis, we selected MiniMol and MolFormer, which exhibit a moderate CKA similarity of 0\.48, indicating that they capture different aspects of molecular structure\. In addition to their complementary representations, both models are well\-established molecular foundation models, making them suitable choices for investigating multi\-teacher distillation\.

### C\.2\.Latency evaluation

To ensure a fair comparison of inference efficiency, we measured the average per\-molecule forward\-pass latency using perf\_counter\(\) from Python’s time library on the workstation described in Table[8](https://arxiv.org/html/2606.11382#A1.T8)\. Measurements were performed with a batch size of one to quantify per\-molecule inference cost independent of batching effects\. To isolate model execution time, data loading, tokenization, feature generation, and other preprocessing operations were excluded\. For Transformer\-based models, dynamic sequence padding was employed to avoid unnecessary computation on padding tokens and provide representative latency estimates\.

### C\.3\.Scaling analysis

The scaling analyses for individual classification and regression tasks, as well as their averages, are provided in Tables[12](https://arxiv.org/html/2606.11382#A3.T12)and[13](https://arxiv.org/html/2606.11382#A3.T13), respectively\.

Table 12\.Effect of pretraining dataset size on classification tasks using GLACIER with MiniMol as a teacher\. The best results for each dataset are marked inbold, and the second\-best results areunderlined\.↑\\uparrow: the higher the better\. AUROC values represent means and their standard deviations from three independent runs\.Table 13\.Effect of pretraining dataset size on regression tasks using GLACIER with MiniMol as a teacher\. The best results for each dataset are marked inbold, and the second\-best results areunderlined\.↓\\downarrow: the lower the better\. RMSE values represent means and their standard deviations from three independent runs\.
### C\.4\.Modality ablation studies

To assess the contribution of each molecular representation and determine whether their integration provides complementary information, we conducted a modality ablation study comparing the full trimodal model against both pairwise bimodal and unimodal variants\. The trimodal GLACIER model \(MiniMol teacher\) consistently achieves the best average performance on both classification and regression benchmarks, as shown in Tables[14](https://arxiv.org/html/2606.11382#A3.T14)and[15](https://arxiv.org/html/2606.11382#A3.T15), respectively\. These results indicate that graph, text, and tabular representations capture complementary aspects of molecular structure and properties, and that their joint integration yields more robust and informative molecular representations than any subset of modalities alone\.

Table 14\.Modality ablation study on classification tasks using GLACIER with MiniMol as a teacher\. The best results for each dataset are marked inbold, and the second\-best results areunderlined\.↑\\uparrow: the higher the better\. AUROC values represent means and their standard deviations from three independent runs\.Table 15\.Modality ablation study on regression tasks using GLACIER with MiniMol as a teacher\. The best results for each dataset are marked inbold, and the second\-best results areunderlined\.↓\\downarrow: the lower the better\. RMSE values represent means and their standard deviations from three independent runs\.
### C\.5\.Model finetuning

Throughout this work, model performance is evaluated using downstream fingerprinting, where embeddings from a single forward pass of a pretrained model are evaluated by a task\-specific head\. In addition to using this lightweight evaluation protocol to estimate the quality of the learned representations, we compared it against finetuning a full model end\-to\-end, in which all model parameters are updated\. Table[16](https://arxiv.org/html/2606.11382#A3.T16)shows that finetuning improves performance for both MolFormer as a teacher and GLACIER as its student model on the ESOL dataset\. However, GLACIER already achieves strong performance in the downstream fingerprinting setting and remains superior after finetuning\. While full finetuning provides substantial gains \(1\.866 to 1\.108 average RMSE\) for the MolFormer teacher, the relatively small improvement \(0\.939 to 0\.882 average RMSE\) observed for its GLACIER student suggests that its pretrained representations are already highly predictive\. Given the increased computational cost of updating the full GLACIER model, these results support the use of a frozen backbone as an efficient and effective downstream strategy\.

Table 16\.Performance comparison between downstream fingerprinting and finetuned models on the ESOL regression task\. The best results are marked inbold\.↓\\downarrow: the lower the better\. RMSE values are represented as means and their corresponding standard deviations from three independent runs\.

## Appendix DArchitecture

The architecture of the Finsler\-based fusion approach is presented in Algorithm[1](https://arxiv.org/html/2606.11382#alg1)and the architecture of the overall distillation pipeline is presented in Algorithm[2](https://arxiv.org/html/2606.11382#alg2)\.

Algorithm 1Multimodal Finsler Fusion1:

𝐳text,𝐳graph,𝐳tab∈ℝd\\mathbf\{z\}\_\{text\},\\mathbf\{z\}\_\{graph\},\\mathbf\{z\}\_\{tab\}\\in\\mathbb\{R\}^\{d\}, MLPs

\{q,k,v,drift,amp\}\\\{q,k,v,drift,amp\\\}, base

λraw\\lambda\_\{raw\}
2:Fused representation

𝐡fused\\mathbf\{h\}\_\{fused\}
3:

Q,𝝎raw←MLPq\(𝐳text\),MLPdrift\(𝐳text\)Q,\\boldsymbol\{\\omega\}\_\{raw\}\\leftarrow\\text\{MLP\}\_\{q\}\(\\mathbf\{z\}\_\{text\}\),\\ \\text\{MLP\}\_\{drift\}\(\\mathbf\{z\}\_\{text\}\)⊳\\trianglerightGet query

4:

𝝎←𝝎raw‖𝝎raw‖2\+ϵ⋅tanh⁡\(‖𝝎raw‖2\)\\boldsymbol\{\\omega\}\\leftarrow\\frac\{\\boldsymbol\{\\omega\}\_\{raw\}\}\{\\\|\\boldsymbol\{\\omega\}\_\{raw\}\\\|\_\{2\}\+\\epsilon\}\\cdot\\tanh\(\\\|\\boldsymbol\{\\omega\}\_\{raw\}\\\|\_\{2\}\)⊳\\trianglerightCalculate drift

5:

S←\{𝐳graph,𝐳tab\}S\\leftarrow\\\{\\mathbf\{z\}\_\{graph\},\\mathbf\{z\}\_\{tab\}\\\}
6:for

𝐤i∈S\\mathbf\{k\}\_\{i\}\\in Sdo⊳\\trianglerightProcess remaining modalities

7:

Ki,Vi←MLPk\(𝐤i\),MLPv\(𝐤i\)K\_\{i\},V\_\{i\}\\leftarrow\\text\{MLP\}\_\{k\}\(\\mathbf\{k\}\_\{i\}\),\\ \\text\{MLP\}\_\{v\}\(\\mathbf\{k\}\_\{i\}\)
8:

di←‖Ki−Q‖2\+⟨Ki−Q,𝝎⟩d\_\{i\}\\leftarrow\\\|K\_\{i\}\-Q\\\|\_\{2\}\+\\langle K\_\{i\}\-Q,\\boldsymbol\{\\omega\}\\rangle⊳\\trianglerightCalculate asymmetric distance

9:endfor

10:

w←Softmax\(−\{dgraph,dtab\}/d\)w\\leftarrow\\text\{Softmax\}\(\-\\\{d\_\{graph\},d\_\{tab\}\\\}/\\sqrt\{d\}\)⊳\\trianglerightCalculate attention weights

11:

𝐜←∑iwiVi\\mathbf\{c\}\\leftarrow\\sum\_\{i\}w\_\{i\}V\_\{i\}
12:

α,λ←Softplus\(MLPamp\(𝐳text\)\),Softplus\(λraw\)\\alpha,\\lambda\\leftarrow\\text\{Softplus\}\(\\text\{MLP\}\_\{amp\}\(\\mathbf\{z\}\_\{text\}\)\),\\ \\text\{Softplus\}\(\\lambda\_\{raw\}\)⊳\\trianglerightCalculate gating factor

13:

γ←α⋅σ\(−min⁡\(di\)⋅λ/d\)\\gamma\\leftarrow\\alpha\\cdot\\sigma\(\-\\min\(d\_\{i\}\)\\cdot\\lambda/\\sqrt\{d\}\)
14:

𝐳^text←𝐳text\+γ𝐜\\mathbf\{\\hat\{z\}\}\_\{text\}\\leftarrow\\mathbf\{z\}\_\{text\}\+\\gamma\\mathbf\{c\}⊳\\trianglerightUpdate text representation

15:return

LayerNorm\(Linear\(𝐳graph‖𝐳^text‖𝐳tab\)\)\\text\{LayerNorm\}\(\\text\{Linear\}\(\\mathbf\{z\}\_\{graph\}\\parallel\\mathbf\{\\hat\{z\}\}\_\{text\}\\parallel\\mathbf\{z\}\_\{tab\}\)\)

Algorithm 2Multimodal Pretraining with Student\-Teacher Distillation1:Data

𝒳\\mathcal\{X\}, teachers

TrawT\_\{raw\}, temp

τ\\tau, min\-trust

ϵ\\epsilon
2:Dynamic distillation loss

ℒtotal\\mathcal\{L\}\_\{total\}
3:

𝐳graph,𝐳text,𝐳tab←Encoders\(𝒳graph,𝒳text,𝒳tab\)\\mathbf\{z\}\_\{graph\},\\mathbf\{z\}\_\{text\},\\mathbf\{z\}\_\{tab\}\\leftarrow\\text\{Encoders\}\(\\mathcal\{X\}\_\{graph\},\\mathcal\{X\}\_\{text\},\\mathcal\{X\}\_\{tab\}\)⊳\\trianglerightStep 1: Feature Extraction

4:⊳\\trianglerightStep 2: Finsler Fusion

5:

𝐡fused←Algorithm[1](https://arxiv.org/html/2606.11382#alg1)\(𝐳graph,𝐳text,𝐳tab\)\\mathbf\{h\}\_\{fused\}\\leftarrow\\text\{Algorithm \\ref\{alg:finsler\_fusion\}\}\(\\mathbf\{z\}\_\{graph\},\\mathbf\{z\}\_\{text\},\\mathbf\{z\}\_\{tab\}\)
6:

𝐡proj←Projectorstudent\(𝐡fused\)\\mathbf\{h\}\_\{proj\}\\leftarrow\\text\{Projector\}\_\{student\}\(\\mathbf\{h\}\_\{fused\}\)
7:

𝐰trust←σ\(MLPtrust\(𝐡proj\)⋅\(1−ϵ\)\+ϵ\\mathbf\{w\}\_\{trust\}\\leftarrow\\sigma\(\\text\{MLP\}\_\{trust\}\(\\mathbf\{h\}\_\{proj\}\)\\cdot\(1\-\\epsilon\)\+\\epsilon
8:⊳\\trianglerightStep 3: Student\-Teacher InfoNCE Distillation

9:

ℒtotal←0,𝐡norm←Normalize\(𝐡proj\)\\mathcal\{L\}\_\{total\}\\leftarrow 0,\\quad\\mathbf\{h\}\_\{norm\}\\leftarrow\\text\{Normalize\}\(\\mathbf\{h\}\_\{proj\}\)⊳\\trianglerightInitialize loss & normalize student

10:for

Ti∈TrawT\_\{i\}\\in T\_\{raw\}do

11:

Tnorm←Normalize\(Projectori\(Ti\)\)T\_\{norm\}\\leftarrow\\text\{Normalize\}\(\\text\{Projector\}\_\{i\}\(T\_\{i\}\)\)⊳\\trianglerightProject and normalize teacher

12:

ℒNCE←CrossEntropy\(𝐡normTnorm⊤/τ\)\\mathcal\{L\}\_\{NCE\}\\leftarrow\\text\{CrossEntropy\}\(\\mathbf\{h\}\_\{norm\}T\_\{norm\}^\{\\top\}/\\tau\)⊳\\trianglerightContrastive alignment

13:

ℒtotal←𝐰trust,i⋅ℒNCE−log⁡\(𝐰trust,i\)\\mathcal\{L\}\_\{total\}\\leftarrow\\mathbf\{w\}\_\{\\text\{trust\},i\}\\cdot\\mathcal\{L\}\_\{NCE\}\-\\log\(\\mathbf\{w\}\_\{\\text\{trust\},i\}\)
14:endfor

15:return

ℒtotal\\mathcal\{L\}\_\{total\}
GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

Similar Articles

Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction

Controllable Molecular Generative Foundation Models

Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning with Vision-Language Models

Rethinking Molecular OOD Generalization via Target-Aware Source Selection

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

Submit Feedback

Similar Articles

Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction
Controllable Molecular Generative Foundation Models
Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning with Vision-Language Models
Rethinking Molecular OOD Generalization via Target-Aware Source Selection
PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design