Scalable Peptide Design via Memory-Efficient Equivariant Transformer

arXiv cs.LG 06/25/26, 04:00 AM Papers
Summary
Introduces MEET, a memory-efficient E(3) equivariant transformer for full-atom peptide design, integrated with a VAE and latent diffusion pipeline to achieve linear memory scaling and improved generation quality.
arXiv:2606.25006v1 Announce Type: new Abstract: Target-specific peptide design requires sequence and structure co-design under full atom geometric constraints. Latent generative frameworks offer an effective route for this problem by compressing fine grained atomic structures into block level latent representations and performing conditional generation in a compact latent space. However, the scalability of such systems depends heavily on the geometric backbone used throughout their encoding, decoding, and denoising components. We introduce MEET (Memory Efficient Equivariant Transformer), an E(3) equivariant backbone for scalable atomistic peptide modeling. MEET maintains coupled invariant scalar and equivariant vector feature streams, while reformulating geometric computation around memory efficient attention. It initializes vector features through global coordinate aggregation, incorporates pairwise distances through augmented query and key dot products, and injects covalent bond information through sparse bond adaptation. Integrated into a VAE and latent diffusion pipeline for full atom peptide generation, \model{} achieves linear memory scaling with atom count and improves generation quality over existing peptide design methods. Experiments on large scale AFDB derived datasets further show that the proposed backbone supports systematic model and data scaling, leading to better binding affinity, physical validity, and sample diversity.
Original Article
View Cached Full Text
Cached at: 06/25/26, 05:11 AM
# Scalable Peptide Design via Memory-Efficient Equivariant Transformer
Source: [https://arxiv.org/html/2606.25006](https://arxiv.org/html/2606.25006)
1\]Department of Computer Science and Technology, Tsinghua University 2\]Institute for AI Industry Research, Tsinghua University 3\]Department of Electronic Engineering, Tsinghua University 4\]Department of Chemistry, Tsinghua University\\contribution\[\*\]Equal contribution

Xiangzhe Kong\*Yinjun Jia\*Yijia ZhangZiyi YangYang LiuJianzhu Ma†\[\[\[\[[majianzhu@tsinghua\.edu\.cn](https://arxiv.org/html/2606.25006v1/mailto:[email protected])

###### Abstract

Target\-specific peptide design requires sequence and structure co\-design under full atom geometric constraints\. Latent generative frameworks offer an effective route for this problem by compressing fine grained atomic structures into block level latent representations and performing conditional generation in a compact latent space\. However, the scalability of such systems depends heavily on the geometric backbone used throughout their encoding, decoding, and denoising components\. We introduceMeet\(MemoryEfficientEquivariantTransformer\), an E\(3\) equivariant backbone for scalable atomistic peptide modeling\.Meetmaintains coupled invariant scalar and equivariant vector feature streams, while reformulating geometric computation around memory efficient attention\. It initializes vector features through global coordinate aggregation, incorporates pairwise distances through augmented query and key dot products, and injects covalent bond information through sparse bond adaptation\. Integrated into a VAE and latent diffusion pipeline for full atom peptide generation,Meetachieves linear memory scaling with atom count and improves generation quality over existing peptide design methods\. Experiments on large scale AFDB derived datasets further show that the proposed backbone supports systematic model and data scaling, leading to better binding affinity, physical validity, and sample diversity\.

\\correspondence\\codes

https://github\.com/jiaor17/MEET

## 1Introduction

Designing peptides that bind a specified protein pocket is a central problem in structure based drug discovery\[wang2022therapeutic\]\. The task is a form of sequence and structure co\-design, where peptide sequence, conformation, and pocket binding geometry are coupled through physical interactions\. Accurate modeling therefore requires full atom geometric reasoning over side chain packing, hydrogen bonding, shape complementarity, and local steric compatibility\. This makes E\(3\) equivariant architectures\[thomas2018tensor,satorras2021en,deng2021vectorneuronsgeneralframework,schutt2021equivariant\]a natural choice, since they respect the symmetries of three dimensional space while operating directly on molecular geometry\.

Latent generative frameworks\[rombach2022high,kong2025unimomo\]offer a practical way to balance full atom fidelity with generative efficiency\. In this formulation, a VAE compresses fine grained atomic structures into block level latent points, so that local atomic details are preserved through reconstruction rather than represented explicitly throughout the diffusion process\. Generation is then performed over a compact latent graph conditioned on the target pocket, with the decoder mapping the generated latent states back to full atom geometry\. As the geometric backbone is instantiated throughout the VAE and LDM components, its expressiveness and memory efficiency become central to the scalability of the overall generative system\.

This places a strong demand on the backbone used for full atom peptide and pocket complexes\. Such complexes often contain hundreds to thousands of atoms\. Existing geometric backbones commonly introduce coordinate dependence either through a dense distance matrix\[jiao2025equivariantpretrainedtransformerunified\], or through explicit local molecular graphs\[schutt2018schnet,satorras2021en,schutt2021equivariant,tholke2022torchmdnet,liao2023equiformer\]\. These mechanisms are effective, but they require storing or recomputing pairwise biases, neighbor lists, or edge features, which increases memory traffic and restricts the feasible model size and batch size\. The issue becomes more pronounced when scaling generative models, where improvements in sample quality often require both larger backbones and larger training sets\.

To address this backbone level bottleneck, we introduceMeet\(MemoryEfficientEquivariantTransformer\), an E\(3\) equivariant Transformer backbone for scalable atomistic peptide modeling\.Meetkeeps coupled scalar and vector feature streams, where the scalar stream is rotation invariant and the vector stream is rotation equivariant\. Its geometric operations are written in forms that are compatible with memory efficient attention kernels such as FlashAttention\[dao2022flashattention\]\. In particular,Meetinitializes vector features through global coordinate aggregation, folds pairwise distance information into augmented query and key dot products, and injects covalent bond information through a sparse bond adapter\. Together these design choices avoid materializing quadratic activation tensors when memory efficient attention is used, while preserving E\(3\) equivariance of the full backbone\.

We evaluate this backbone inside a two stage latent generative framework inspired by UniMoMo\[kong2025unimomo\]\. Within this framework,Meetserves as the geometric backbone for the encoder, block type decoder, structure decoder, and LDM denoiser\. This setting lets us test whether a more memory efficient equivariant backbone improves the complete generation system\. Figure[1](https://arxiv.org/html/2606.25006#S1.F1)summarizes the architecture and the main geometric modules\. Our scaling experiments focus on the LDM denoiser, whose capacity directly affects the quality of generated peptides in our framework\.

Together, the architecture and generative evaluation lead to three concrete contributions\.

Efficient architecture\.Meetreplaces dense distance biases with query and key augmentation, replaces local\-graph vector initialization with global attention aggregation, and uses sparse bond adaptation for chemical adjacency\. These choices give linear peak activation memory in the number of atoms for a fixed model size when paired with memory efficient attention kernels\.

Large scale datasets\.Starting from8\.648\.64million AFDB\[varadi2022alphafold\]domains, we construct approximately100100million candidate segments and sample100K100\\mathrm\{K\}and1\.2M1\.2\\mathrm\{M\}training structures using sliding window enumeration, structure quality filtering, interface screening, and sequence overlap clustering\.

Systematic scaling\.On the100K100\\mathrm\{K\}benchmark,Meetimproves over PepGLAD\[kong2024pepglad\], PepFlow\[li2024pepflow\], UniMoMo\[kong2025unimomo\], and DiffPepBuilder\[wang2024diffpepbuilder\]on binding free energy and physical validity\. Scaling the latent denoiser across four DiT\-style\[peebles2023scalable\]model sizes on1\.2M1\.2\\mathrm\{M\}datasets further improves generation quality and restores sample diversity\.

![Refer to caption](https://arxiv.org/html/2606.25006v1/x1.png)Figure 1:Overview of theMeetarchitecture\.Meetprocesses atom coordinates𝑿\{\\bm\{X\}\}together with scalar and vector feature streams𝑯\{\\bm\{H\}\}and𝑽\{\\bm\{V\}\}\. The initialization layer uses distance aware attention to produce𝑯\(0\)\{\\bm\{H\}\}^\{\(0\)\}and𝑽\(0\)\{\\bm\{V\}\}^\{\(0\)\}\. Each repeated block contains a bond adapter, an equivariant self attention layer, and an equivariant feed forward layer, producing final features𝑯\(L\)\{\\bm\{H\}\}^\{\(L\)\}and𝑽\(L\)\{\\bm\{V\}\}^\{\(L\)\}\. In self attention, scalar and vector inputs are projected into𝑸\{\\bm\{Q\}\},𝑲\{\\bm\{K\}\}, and𝑼\{\\bm\{U\}\}, with distance encodings−ϕs\-\\phi\_\{s\}andψs\\psi\_\{s\}appended to𝑸\{\\bm\{Q\}\}and𝑲\{\\bm\{K\}\}\. The bond adapter injects edge attributeseije\_\{ij\}through sparse messagesΔ𝒉i←j\\Delta\{\\bm\{h\}\}\_\{i\\leftarrow j\}andΔ𝒗i←j\\Delta\{\\bm\{v\}\}\_\{i\\leftarrow j\}\.
## 2Methods

Our peptide generator follows a two\-stage latent generative framework inspired by UniMoMo\[kong2025unimomo\]\. A VAE first encodes each full\-atom peptide–pocket complex into block\-level latent variables, a conditional latent diffusion model generates peptide latents from the target\-pocket context, and the VAE decoder maps the generated latents back to peptide sequence and full\-atom geometry\.Meetserves as the geometric backbone in the VAE encoder, sequence decoder, structure decoder, and latent denoiser\. The complete training objectives and sampling procedure are provided in Appendix[5](https://arxiv.org/html/2606.25006#S5)\.

In the following section, we first present an overview of theMeetarchitecture in §[2\.1](https://arxiv.org/html/2606.25006#S2.SS1)\. We then describe the distance\-aware attention mechanism \(§[2\.2](https://arxiv.org/html/2606.25006#S2.SS2)\) that is shared across the backbone, followed by the four main modules shown in Figure[1](https://arxiv.org/html/2606.25006#S1.F1), including feature initialization \(§[2\.3](https://arxiv.org/html/2606.25006#S2.SS3)\), equivariant self\-attention \(§[2\.4](https://arxiv.org/html/2606.25006#S2.SS4)\), equivariant feed\-forward layer \(§[2\.5](https://arxiv.org/html/2606.25006#S2.SS5)\), and bond adapter \(§[2\.6](https://arxiv.org/html/2606.25006#S2.SS6)\)\. Finally, we analyze the space complexity in §[2\.7](https://arxiv.org/html/2606.25006#S2.SS7)\.

### 2\.1Architecture Overview

Meetis a memory\-efficient equivariant Transformer that maintains two coupled feature streams, scalar and vector\. We use the same notation as Figure[1](https://arxiv.org/html/2606.25006#S1.F1), where𝑯\{\\bm\{H\}\}denotes the scalar stream,𝑽\{\\bm\{V\}\}denotes the vector stream, and𝑿\{\\bm\{X\}\}denotes atom coordinates\. For a molecule withNNatoms, let𝑿∈ℝN×3\{\\bm\{X\}\}\\in\\mathbb\{R\}^\{N\\times 3\}have rows𝒙i∈ℝ3\{\\bm\{x\}\}\_\{i\}\\in\\mathbb\{R\}^\{3\}\. We represent per\-atom features as a pair\(𝑯,𝑽\)\(\{\\bm\{H\}\},\{\\bm\{V\}\}\), where𝑯∈ℝN×d\{\\bm\{H\}\}\\in\\mathbb\{R\}^\{N\\times d\}are scalar \(rotation\-invariant\) features and𝑽∈ℝN×3×d\{\\bm\{V\}\}\\in\\mathbb\{R\}^\{N\\times 3\\times d\}are vector \(rotation\-equivariant\) features, following the common scalar\-vector decomposition used in equivariant neural networks\[deng2021vectorneuronsgeneralframework\]\. Under a rigid transformation\(𝑹,𝒕\)\(\{\\bm\{R\}\},\{\\bm\{t\}\}\)with𝑹∈SO\(3\)\{\\bm\{R\}\}\\in\\mathrm\{SO\}\(3\)and𝒕∈ℝ3\{\\bm\{t\}\}\\in\\mathbb\{R\}^\{3\}, features transform as𝑯↦𝑯\{\\bm\{H\}\}\\mapsto\{\\bm\{H\}\}and𝑽↦𝑹𝑽\{\\bm\{V\}\}\\mapsto\{\\bm\{R\}\}\\,\{\\bm\{V\}\}\. All modules ofMeetare therefore E\(3\)\-equivariant under the full Euclidean group\.

Given input atom attributes𝑨\{\\bm\{A\}\}and coordinates𝑿\{\\bm\{X\}\}, the backbone first constructs an initial equivariant state via an initialization operatorℐ\\mathcal\{I\},

\(𝑯\(0\),𝑽\(0\)\)=ℐ\(𝑨,𝑿\),\(\{\\bm\{H\}\}^\{\(0\)\},\{\\bm\{V\}\}^\{\(0\)\}\)\\;=\\;\\mathcal\{I\}\\bigl\(\{\\bm\{A\}\},\{\\bm\{X\}\}\\bigr\),\(1\)and then appliesLLTransformer blocks to produce\(𝑯\(L\),𝑽\(L\)\)\(\{\\bm\{H\}\}^\{\(L\)\},\{\\bm\{V\}\}^\{\(L\)\}\)\. Each block consists of a self\-attention layer𝒜\\mathcal\{A\}and a vector–scalar mixing feed\-forward layerℱ\\mathcal\{F\}, optionally preceded by a bond adapterℬ\\mathcal\{B\}when a bond edge setℰ\\mathcal\{E\}is provided\.

\(𝑯¯\(ℓ\),𝑽¯\(ℓ\)\)\\displaystyle\(\\bar\{\{\\bm\{H\}\}\}^\{\(\\ell\)\},\\bar\{\{\\bm\{V\}\}\}^\{\(\\ell\)\}\)=\(𝑯\(ℓ\),𝑽\(ℓ\)\)\+ℬ\(ℓ\)\(Norm\(𝑯\(ℓ\),𝑽\(ℓ\)\),ℰ\),\\displaystyle\\;=\\;\(\{\\bm\{H\}\}^\{\(\\ell\)\},\{\\bm\{V\}\}^\{\(\\ell\)\}\)\\;\+\\;\\mathcal\{B\}^\{\(\\ell\)\}\\\!\\bigl\(\\mathrm\{Norm\}\(\{\\bm\{H\}\}^\{\(\\ell\)\},\{\\bm\{V\}\}^\{\(\\ell\)\}\),\\,\\mathcal\{E\}\\bigr\),\(2\)\(𝑯~\(ℓ\),𝑽~\(ℓ\)\)\\displaystyle\(\\tilde\{\{\\bm\{H\}\}\}^\{\(\\ell\)\},\\tilde\{\{\\bm\{V\}\}\}^\{\(\\ell\)\}\)=\(𝑯¯\(ℓ\),𝑽¯\(ℓ\)\)\+𝒜\(ℓ\)\(Norm\(𝑯¯\(ℓ\),𝑽¯\(ℓ\)\),𝑿\),\\displaystyle\\;=\\;\(\\bar\{\{\\bm\{H\}\}\}^\{\(\\ell\)\},\\bar\{\{\\bm\{V\}\}\}^\{\(\\ell\)\}\)\\;\+\\;\\mathcal\{A\}^\{\(\\ell\)\}\\\!\\bigl\(\\mathrm\{Norm\}\(\\bar\{\{\\bm\{H\}\}\}^\{\(\\ell\)\},\\bar\{\{\\bm\{V\}\}\}^\{\(\\ell\)\}\),\\,\{\\bm\{X\}\}\\bigr\),\(3\)\(𝑯\(ℓ\+1\),𝑽\(ℓ\+1\)\)\\displaystyle\(\{\\bm\{H\}\}^\{\(\\ell\+1\)\},\{\\bm\{V\}\}^\{\(\\ell\+1\)\}\)=\(𝑯~\(ℓ\),𝑽~\(ℓ\)\)\+ℱ\(ℓ\)\(Norm\(𝑯~\(ℓ\),𝑽~\(ℓ\)\)\)\.\\displaystyle\\;=\\;\(\\tilde\{\{\\bm\{H\}\}\}^\{\(\\ell\)\},\\tilde\{\{\\bm\{V\}\}\}^\{\(\\ell\)\}\)\\;\+\\;\\mathcal\{F\}^\{\(\\ell\)\}\\\!\\bigl\(\\mathrm\{Norm\}\(\\tilde\{\{\\bm\{H\}\}\}^\{\(\\ell\)\},\\tilde\{\{\\bm\{V\}\}\}^\{\(\\ell\)\}\)\\bigr\)\.\(4\)Each layer is wrapped with pre\-normalization and a residual connection, following standard Transformer practice\. In module\-level descriptions below,\(𝑯in,𝑽in\)\(\{\\bm\{H\}\}\_\{\\mathrm\{in\}\},\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}\)and\(𝑯out,𝑽out\)\(\{\\bm\{H\}\}\_\{\\mathrm\{out\}\},\{\\bm\{V\}\}\_\{\\mathrm\{out\}\}\)refer to the streams entering and leaving the corresponding panel in Figure[1](https://arxiv.org/html/2606.25006#S1.F1)\. The bond adapter can be applied in every block, only in the first block, or disabled entirely depending on bond availability\.

#### Normalization\.

For the scalar stream we use RMSNorm\[zhang2019root\]\. For the vector stream we use a rotation\-invariant analogue that normalizes by the combined root\-mean\-square over the spatial and channel axes of each atom’s vector features,

RMSNormV\(𝑽i\)=𝑽i13d∑k,c𝑽i,k,c2\+ϵ\.\\mathrm\{RMSNorm\}\_\{V\}\(\{\\bm\{V\}\}\_\{i\}\)\\;=\\;\\frac\{\{\\bm\{V\}\}\_\{i\}\}\{\\sqrt\{\\dfrac\{1\}\{3d\}\\sum\_\{k,c\}\{\\bm\{V\}\}\_\{i,k,c\}^\{2\}\+\\epsilon\}\}\.\(5\)The scale factor in the denominator is a rotation invariant, so the output transforms as𝑹𝑽i\{\\bm\{R\}\}\\,\{\\bm\{V\}\}\_\{i\}under a rotation and remains equivariant\.

### 2\.2Distance\-aware Attention Mechanism

A component shared by the initialization layerℐ\\mathcal\{I\}and the self\-attention layers𝒜\\mathcal\{A\}is a distance\-aware query–key augmentation\. For headhh, the attention logit includes an invariant distance penalty,

Aij\(h\)=ρh\(⟨𝒒i\(h\),𝒌j\(h\)⟩−sh2∥𝒙i−𝒙j∥22\),A\_\{ij\}^\{\(h\)\}\\;=\\;\\rho\_\{h\}\\Bigl\(\\langle\{\\bm\{q\}\}\_\{i\}^\{\(h\)\},\{\\bm\{k\}\}\_\{j\}^\{\(h\)\}\\rangle\\;\-\\;s\_\{h\}^\{2\}\\,\\lVert\{\\bm\{x\}\}\_\{i\}\-\{\\bm\{x\}\}\_\{j\}\\rVert\_\{2\}^\{2\}\\Bigr\),\(6\)where𝒒i\(h\)\{\\bm\{q\}\}\_\{i\}^\{\(h\)\}and𝒌j\(h\)\{\\bm\{k\}\}\_\{j\}^\{\(h\)\}are the module\-specific query and key features,ρh\\rho\_\{h\}is the attention scale, andshs\_\{h\}is a learnable distance scale\.

According to FlashBias\[wu2026flashbias\], rather than storing the distance term as anN×NN\\times Nbias tensor, we absorb it into the query–key dot product\. For coordinates𝒙i\{\\bm\{x\}\}\_\{i\}with components\(x1,i,x2,i,x3,i\)\(x\_\{1,i\},x\_\{2,i\},x\_\{3,i\}\), define

ϕsh\(𝒙i\)=sh⨁k=13\[xk,i2,1,−2xk,i\],ψsh\(𝒙j\)=sh⨁k=13\[1,xk,j2,xk,j\]\.\\phi\_\{s\_\{h\}\}\(\{\\bm\{x\}\}\_\{i\}\)\\;=\\;s\_\{h\}\\bigoplus\_\{k=1\}^\{3\}\\bigl\[x\_\{k,i\}^\{2\},\\;1,\\;\-2x\_\{k,i\}\\bigr\],\\qquad\\psi\_\{s\_\{h\}\}\(\{\\bm\{x\}\}\_\{j\}\)\\;=\\;s\_\{h\}\\bigoplus\_\{k=1\}^\{3\}\\bigl\[1,\\;x\_\{k,j\}^\{2\},\\;x\_\{k,j\}\\bigr\]\.\(7\)Since⟨ϕsh\(𝒙i\),ψsh\(𝒙j\)⟩=sh2∥𝒙i−𝒙j∥22\\langle\\phi\_\{s\_\{h\}\}\(\{\\bm\{x\}\}\_\{i\}\),\\psi\_\{s\_\{h\}\}\(\{\\bm\{x\}\}\_\{j\}\)\\rangle=s\_\{h\}^\{2\}\\lVert\{\\bm\{x\}\}\_\{i\}\-\{\\bm\{x\}\}\_\{j\}\\rVert\_\{2\}^\{2\}, concatenating−ϕs\-\\phi\_\{s\}to the query andψs\\psi\_\{s\}to the key reproduces Equation equation[6](https://arxiv.org/html/2606.25006#S2.E6),

𝑸i\(h\)=concat\(𝒒i\(h\),−ϕsh\(𝒙i\)\),𝑲j\(h\)=concat\(𝒌j\(h\),ψsh\(𝒙j\)\),Aij\(h\)=ρh⟨𝑸i\(h\),𝑲j\(h\)⟩\.\{\\bm\{Q\}\}\_\{i\}^\{\(h\)\}\\;=\\;\\mathrm\{concat\}\\bigl\(\{\\bm\{q\}\}\_\{i\}^\{\(h\)\},\\;\-\\phi\_\{s\_\{h\}\}\(\{\\bm\{x\}\}\_\{i\}\)\\bigr\),\\qquad\{\\bm\{K\}\}\_\{j\}^\{\(h\)\}\\;=\\;\\mathrm\{concat\}\\bigl\(\{\\bm\{k\}\}\_\{j\}^\{\(h\)\},\\;\\psi\_\{s\_\{h\}\}\(\{\\bm\{x\}\}\_\{j\}\)\\bigr\),\\qquad A\_\{ij\}^\{\(h\)\}\\;=\\;\\rho\_\{h\}\\langle\{\\bm\{Q\}\}\_\{i\}^\{\(h\)\},\\,\{\\bm\{K\}\}\_\{j\}^\{\(h\)\}\\rangle\.\(8\)The augmentation adds only99entries per head and keeps the attention computation compatible with fused kernels without an external distance\-bias tensor\. The resulting softmax weights

αij\(h\)=softmaxj\(Aij\(h\)\)\\alpha\_\{ij\}^\{\(h\)\}\\;=\\;\\mathrm\{softmax\}\_\{j\}\\bigl\(A\_\{ij\}^\{\(h\)\}\\bigr\)\(9\)depend on coordinates only through pairwise distances and are therefore E\(3\)\-invariant\.

### 2\.3Feature Initialization

The initialization layerℐ\\mathcal\{I\}prepares the initial scalar and vector streams from atom attributes and coordinates\. The scalar stream is obtained by embedding the input atom attributes, denoted as𝑯\(0\)=Embed\(𝑨\)\{\\bm\{H\}\}^\{\(0\)\}=\\mathrm\{Embed\}\(\{\\bm\{A\}\}\)\. The main role ofℐ\\mathcal\{I\}is therefore to construct a non\-trivial vector stream𝑽\(0\)\{\\bm\{V\}\}^\{\(0\)\}from geometry\. This step is necessary because every subsequent operation on the vector stream is either a bias\-free linear projection along the channel axis or a linear combination of equivariant vectors\. If𝑽\{\\bm\{V\}\}were initialized as zero, the vector branch would remain uninformative and could not carry directional information\.

Earlier equivariant backbones typically obtain directional states by aggregating relative positions over explicit local molecular graphs\[schutt2018schnet,satorras2021en,schutt2021equivariant,tholke2022torchmdnet,liao2023equiformer\]\. Such designs require constructing neighbor edges and storing edge features, which increases memory traffic for large full\-atom complexes\. We instead initialize vector features through one round of distance\-aware multi\-head attention over the atoms in a complex, keeping the initialization compatible with memory\-efficient attention kernels without constructing an explicit local molecular graph\.

Concretely, queries and keys are computed from𝑯\(0\)\{\\bm\{H\}\}^\{\(0\)\}and augmented with the distance encodings in Section[2\.2](https://arxiv.org/html/2606.25006#S2.SS2), yielding invariant per\-head attention weightsαij\(h\)\\alpha\_\{ij\}^\{\(h\)\}\. We collect these weights into a per\-head attention matrix𝜶\(h\)∈ℝN×N\{\\bm\{\\alpha\}\}^\{\(h\)\}\\in\\mathbb\{R\}^\{N\\times N\}\. Unlike a standard attention layer, the value inℐ\\mathcal\{I\}is the coordinate matrix𝑿∈ℝN×3\{\\bm\{X\}\}\\in\\mathbb\{R\}^\{N\\times 3\}itself, shared across heads\. For each headhh, the attended coordinate is centered at the query atom,

𝑽\(h\)=𝜶\(h\)𝑿−𝑿=\(𝜶\(h\)−𝑰\)𝑿∈ℝN×3\.\{\\bm\{V\}\}^\{\(h\)\}\\;=\\;\{\\bm\{\\alpha\}\}^\{\(h\)\}\{\\bm\{X\}\}\-\{\\bm\{X\}\}\\;=\\;\\bigl\(\{\\bm\{\\alpha\}\}^\{\(h\)\}\-\{\\bm\{I\}\}\\bigr\)\{\\bm\{X\}\}\\;\\in\\;\\mathbb\{R\}^\{N\\times 3\}\.\(10\)Equivalently, the vector at atomiiis

𝒗i\(h\)=∑jαij\(h\)\(𝒙j−𝒙i\)\.\{\\bm\{v\}\}\_\{i\}^\{\(h\)\}\\;=\\;\\sum\_\{j\}\\alpha\_\{ij\}^\{\(h\)\}\(\{\\bm\{x\}\}\_\{j\}\-\{\\bm\{x\}\}\_\{i\}\)\.\(11\)The centering by𝒙i\{\\bm\{x\}\}\_\{i\}removes dependence on the absolute coordinate frame and ensures that only relative displacements enter the vector stream\.

TheHHper\-head displacement fields are then stacked along a new channel axis and projected to the model dimension with a bias\-free linear map,

𝑽\(0\)=stackh\(𝑽\(h\)\)𝑾outℐ∈ℝN×3×d,\{\\bm\{V\}\}^\{\(0\)\}\\;=\\;\\mathrm\{stack\}\_\{h\}\\bigl\(\{\\bm\{V\}\}^\{\(h\)\}\\bigr\)\\,\{\\bm\{W\}\}\_\{\\text\{out\}\}^\{\\mathcal\{I\}\}\\;\\in\\;\\mathbb\{R\}^\{N\\times 3\\times d\},\(12\)wherestackh\\mathrm\{stack\}\_\{h\}forms a tensor inℝN×3×H\\mathbb\{R\}^\{N\\times 3\\times H\}before the projection mixes the head axis intoddvector channels while leaving the spatial axis untouched\. Becauseαij\(h\)\\alpha\_\{ij\}^\{\(h\)\}depends only on pairwise distances, it is invariant to rigid transformations\. The centered displacement∑jαij\(h\)\(𝒙j−𝒙i\)\\sum\_\{j\}\\alpha\_\{ij\}^\{\(h\)\}\(\{\\bm\{x\}\}\_\{j\}\-\{\\bm\{x\}\}\_\{i\}\)is invariant to translation and transforms as a 3D vector under rotation\. The bias\-free output projection only mixes channels, so𝑽\(0\)\{\\bm\{V\}\}^\{\(0\)\}is equivariant\.

### 2\.4Equivariant Self\-Attention

Each Transformer block applies an equivariant self\-attention layer𝒜\\mathcal\{A\}that updates both streams in a single multi\-head attention pass, mapping\(𝑯in,𝑽in\)\(\{\\bm\{H\}\}\_\{\\mathrm\{in\}\},\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}\)to\(𝑯out,𝑽out\)\(\{\\bm\{H\}\}\_\{\\mathrm\{out\}\},\{\\bm\{V\}\}\_\{\\mathrm\{out\}\}\)as in Figure[1](https://arxiv.org/html/2606.25006#S1.F1)\.

#### Q, K, and U preparation\.

The scalar and vector streams are projected independently\. An unconstrained linear layer maps𝑯in\{\\bm\{H\}\}\_\{\\mathrm\{in\}\}to\(𝑯Q,𝑯K,𝑯U\)\(\{\\bm\{H\}\}\_\{Q\},\{\\bm\{H\}\}\_\{K\},\{\\bm\{H\}\}\_\{U\}\), while a bias\-free linear layer followed by flattening on the channel axis of𝑽in\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}yields\(𝑽Q,𝑽K,𝑽U\)\(\{\\bm\{V\}\}\_\{Q\},\{\\bm\{V\}\}\_\{K\},\{\\bm\{V\}\}\_\{U\}\)\. The vector queries and keys are normalized viaRMSNormV\\mathrm\{RMSNorm\}\_\{V\}and their spatial axis is flattened into the channel axis\. The per\-head query and key features entering the distance\-aware mechanism of Section[2\.2](https://arxiv.org/html/2606.25006#S2.SS2)are then

𝒒i\(h\)=concat\(𝑯Q,i\(h\),flat\(𝑽Q,i\(h\)\)\),𝒌j\(h\)=concat\(𝑯K,j\(h\),flat\(𝑽K,j\(h\)\)\),\{\\bm\{q\}\}\_\{i\}^\{\(h\)\}=\\mathrm\{concat\}\\bigl\(\{\\bm\{H\}\}\_\{Q,i\}^\{\(h\)\},\\;\\mathrm\{flat\}\(\{\\bm\{V\}\}\_\{Q,i\}^\{\(h\)\}\)\\bigr\),\\qquad\{\\bm\{k\}\}\_\{j\}^\{\(h\)\}=\\mathrm\{concat\}\\bigl\(\{\\bm\{H\}\}\_\{K,j\}^\{\(h\)\},\\;\\mathrm\{flat\}\(\{\\bm\{V\}\}\_\{K,j\}^\{\(h\)\}\)\\bigr\),\(13\)which are augmented with the distance encoding vectors−ϕs\-\\phi\_\{s\}andψs\\psi\_\{s\}\(Equation equation[8](https://arxiv.org/html/2606.25006#S2.E8)\) to form𝑸i\(h\)\{\\bm\{Q\}\}\_\{i\}^\{\(h\)\}and𝑲j\(h\)\{\\bm\{K\}\}\_\{j\}^\{\(h\)\}as before\. The value𝑼j\(h\)\{\\bm\{U\}\}\_\{j\}^\{\(h\)\}concatenates only the scalar and flattened\-vector parts, without any distance encoding\.

𝑼j\(h\)=concat\(𝑯U,j\(h\),flat\(𝑽U,j\(h\)\)\)\.\{\\bm\{U\}\}\_\{j\}^\{\(h\)\}=\\mathrm\{concat\}\\bigl\(\{\\bm\{H\}\}\_\{U,j\}^\{\(h\)\},\\;\\mathrm\{flat\}\(\{\\bm\{V\}\}\_\{U,j\}^\{\(h\)\}\)\\bigr\)\.\(14\)The dot product⟨𝑸i\(h\),𝑲j\(h\)⟩\\langle\{\\bm\{Q\}\}\_\{i\}^\{\(h\)\},\{\\bm\{K\}\}\_\{j\}^\{\(h\)\}\\rangleis E\(3\)\-invariant because the scalar parts are trivially invariant, the flattened vector parts contribute a sum of 3D inner products∑c⟨𝑽Q,i,:,c,𝑽K,j,:,c⟩\\sum\_\{c\}\\langle\{\\bm\{V\}\}\_\{Q,i,:,c\},\{\\bm\{V\}\}\_\{K,j,:,c\}\\ranglethat are invariant by orthogonality of𝑹\{\\bm\{R\}\}, and the distance encoding contributes−sh2∥𝒙i−𝒙j∥22\-s\_\{h\}^\{2\}\\lVert\{\\bm\{x\}\}\_\{i\}\-\{\\bm\{x\}\}\_\{j\}\\rVert\_\{2\}^\{2\}\.

#### Aggregation and output projection\.

The invariant attention weightsαij\(h\)\\alpha\_\{ij\}^\{\(h\)\}are applied to𝑼\(h\)\{\\bm\{U\}\}^\{\(h\)\}, and the result is split into scalar and vector halves\.

𝑶i\(h\)=∑jαij\(h\)𝑼j\(h\),𝑶i\(h\)→\(𝑶H,i\(h\),𝑶V,i\(h\)\)\.\{\\bm\{O\}\}\_\{i\}^\{\(h\)\}=\\sum\_\{j\}\\alpha\_\{ij\}^\{\(h\)\}\\,\{\\bm\{U\}\}\_\{j\}^\{\(h\)\},\\qquad\{\\bm\{O\}\}\_\{i\}^\{\(h\)\}\\to\\bigl\(\{\\bm\{O\}\}\_\{H,i\}^\{\(h\)\},\\;\{\\bm\{O\}\}\_\{V,i\}^\{\(h\)\}\\bigr\)\.\(15\)The per\-head outputs are concatenated across heads and projected back to the model dimension by two independent output projections\.

𝑯out,i=concath\(𝑶H,i\(h\)\)𝑾outH,𝑽out,i=concath\(𝑶V,i\(h\)\)𝑾outV,\{\\bm\{H\}\}\_\{\\mathrm\{out\},i\}=\\mathrm\{concat\}\_\{h\}\\\!\\bigl\(\{\\bm\{O\}\}\_\{H,i\}^\{\(h\)\}\\bigr\)\\,\{\\bm\{W\}\}\_\{\\text\{out\}\}^\{H\},\\qquad\{\\bm\{V\}\}\_\{\\mathrm\{out\},i\}=\\mathrm\{concat\}\_\{h\}\\\!\\bigl\(\{\\bm\{O\}\}\_\{V,i\}^\{\(h\)\}\\bigr\)\\,\{\\bm\{W\}\}\_\{\\text\{out\}\}^\{V\},\(16\)where𝑾outH\{\\bm\{W\}\}\_\{\\text\{out\}\}^\{H\}is unconstrained and𝑾outV\{\\bm\{W\}\}\_\{\\text\{out\}\}^\{V\}is bias\-free and acts on the channel axis only\. The scalar branch manipulates invariant quantities throughout, and the vector branch applies only bias\-free channel mixing to equivariant tensors, so E\(3\) equivariance is preserved\.

### 2\.5Equivariant Feed\-Forward Layer

Each attention layer is followed by a feed\-forward layerℱ\\mathcal\{F\}that couples the scalar and vector streams while preserving E\(3\) equivariance\. The layer maps\(𝑯in,𝑽in\)\(\{\\bm\{H\}\}\_\{\\mathrm\{in\}\},\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}\)to\(𝑯out,𝑽out\)\(\{\\bm\{H\}\}\_\{\\mathrm\{out\}\},\{\\bm\{V\}\}\_\{\\mathrm\{out\}\}\)as in Figure[1](https://arxiv.org/html/2606.25006#S1.F1)\. The vector features are first projected through a bias\-free linear layer and split into a scalar\-summary part and a hidden part\.

\(𝑽in\(1\),𝑽in\(2\)\)=split\(𝑽in𝑾Vin\),𝑽in\(1\)∈ℝN×3×d,𝑽in\(2\)∈ℝN×3×dff\.\(\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}^\{\(1\)\},\\;\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}^\{\(2\)\}\)\\;=\\;\\mathrm\{split\}\\bigl\(\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}\\,\{\\bm\{W\}\}\_\{V\}^\{\\text\{in\}\}\\bigr\),\\qquad\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}^\{\(1\)\}\\in\\mathbb\{R\}^\{N\\times 3\\times d\},\\;\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}^\{\(2\)\}\\in\\mathbb\{R\}^\{N\\times 3\\times d\_\{\\text\{ff\}\}\}\.\(17\)An invariant summary𝒀i=\[∥𝑽in,i,:,1\(1\)∥2,…,∥𝑽in,i,:,d\(1\)∥2\]∈ℝd\{\\bm\{Y\}\}\_\{i\}=\[\\lVert\{\\bm\{V\}\}\_\{\\mathrm\{in\},i,:,1\}^\{\(1\)\}\\rVert\_\{2\},\\ldots,\\lVert\{\\bm\{V\}\}\_\{\\mathrm\{in\},i,:,d\}^\{\(1\)\}\\rVert\_\{2\}\]\\in\\mathbb\{R\}^\{d\}is computed by taking channel\-wise norms\. The scalar branch uses the SwiGLU nonlinearity\[shazeer2020glu\], and the scalar and vector streams are then updated as

\(𝒉\(1\),𝒉\(2\)\)\\displaystyle\(\{\\bm\{h\}\}^\{\(1\)\},\\;\{\\bm\{h\}\}^\{\(2\)\}\)=split\(\[𝑯in,𝒀\]𝑾Hin\+𝒃Hin\),\\displaystyle\\;=\\;\\mathrm\{split\}\\bigl\(\[\{\\bm\{H\}\}\_\{\\mathrm\{in\}\},\{\\bm\{Y\}\}\]\\,\{\\bm\{W\}\}\_\{H\}^\{\\text\{in\}\}\+\{\\bm\{b\}\}\_\{H\}^\{\\text\{in\}\}\\bigr\),\(18\)𝒉\(2\)\\displaystyle\{\\bm\{h\}\}^\{\(2\)\}←SwiGLU\(𝒉\(2\)\),𝑽in\(2\)←SiLU\(𝒉\(1\)\)⊙𝑽in\(2\),\\displaystyle\\;\\leftarrow\\;\\mathrm\{SwiGLU\}\(\{\\bm\{h\}\}^\{\(2\)\}\),\\qquad\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}^\{\(2\)\}\\;\\leftarrow\\;\\mathrm\{SiLU\}\(\{\\bm\{h\}\}^\{\(1\)\}\)\\odot\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}^\{\(2\)\},\(19\)𝑯out\\displaystyle\{\\bm\{H\}\}\_\{\\mathrm\{out\}\}=𝒉\(2\)𝑾Hout\+𝒃Hout,𝑽out=𝑽in\(2\)𝑾Vout\.\\displaystyle\\;=\\;\{\\bm\{h\}\}^\{\(2\)\}\\,\{\\bm\{W\}\}\_\{H\}^\{\\text\{out\}\}\+\{\\bm\{b\}\}\_\{H\}^\{\\text\{out\}\},\\qquad\{\\bm\{V\}\}\_\{\\mathrm\{out\}\}\\;=\\;\{\\bm\{V\}\}\_\{\\mathrm\{in\}\}^\{\(2\)\}\\,\{\\bm\{W\}\}\_\{V\}^\{\\text\{out\}\}\.\(20\)

### 2\.6Bond Adapter

Geometric proximity alone does not fully determine local interactions\. Chemical adjacency, covalent bonds, and other structured edge signals encode strong constraints that a dense attention mechanism captures only indirectly\. Unlike the dense pairwise distance bias, bond adjacency is sparse and low\-degree, so treating it as a dense attention bias would be wasteful\. We instead inject bond information through a sparse message\-passing adapterℬ\\mathcal\{B\}whose cost scales linearly with the number of edges\.

Letℰ\\mathcal\{E\}denote the bond edge set and leteij∈ℝdee\_\{ij\}\\in\\mathbb\{R\}^\{d\_\{e\}\}be an edge attribute for\(i,j\)∈ℰ\(i,j\)\\in\\mathcal\{E\}\. For each edge, we use the notation in Figure[1](https://arxiv.org/html/2606.25006#S1.F1), where𝒉i\{\\bm\{h\}\}\_\{i\}and𝒉j\{\\bm\{h\}\}\_\{j\}are the scalar features of the target and source atoms, and𝒗j\{\\bm\{v\}\}\_\{j\}is the source atom’s vector feature\. We concatenate the scalar endpoint features with the edge attribute and pass the result through a small MLPfθf\_\{\\theta\}\.

𝒎ij=fθ\(\[𝒉i,𝒉j,eij\]\)∈ℝ2d\.\{\\bm\{m\}\}\_\{ij\}\\;=\\;f\_\{\\theta\}\\bigl\(\[\{\\bm\{h\}\}\_\{i\},\\,\{\\bm\{h\}\}\_\{j\},\\,e\_\{ij\}\]\\bigr\)\\;\\in\\;\\mathbb\{R\}^\{2d\}\.\(21\)The output𝒎ij\{\\bm\{m\}\}\_\{ij\}is split into a scalar messageΔ𝒉i←j∈ℝd\\Delta\{\\bm\{h\}\}\_\{i\\leftarrow j\}\\in\\mathbb\{R\}^\{d\}and a gating coefficient𝒈i←j∈ℝd\{\\bm\{g\}\}\_\{i\\leftarrow j\}\\in\\mathbb\{R\}^\{d\}\. The scalar messages are aggregated by mean\-pooling over incoming edges, while the vector messageΔ𝒗i←j\\Delta\{\\bm\{v\}\}\_\{i\\leftarrow j\}is constructed by gating the source vector feature𝒗j\{\\bm\{v\}\}\_\{j\}with the scalar coefficient𝒈i←j\{\\bm\{g\}\}\_\{i\\leftarrow j\}\.

Δ𝒗i←j=𝒈i←j⊙𝒗j\.\\Delta\{\\bm\{v\}\}\_\{i\\leftarrow j\}\\;=\\;\{\\bm\{g\}\}\_\{i\\leftarrow j\}\\odot\{\\bm\{v\}\}\_\{j\}\.\(22\)The bond\-adapter updates are then

Δ𝑯i=1\|𝒩\(i\)\|∑j∈𝒩\(i\)Δ𝒉i←j,Δ𝑽i=1\|𝒩\(i\)\|∑j∈𝒩\(i\)Δ𝒗i←j,\\Delta\{\\bm\{H\}\}\_\{i\}\\;=\\;\\frac\{1\}\{\|\\mathcal\{N\}\(i\)\|\}\\sum\_\{j\\in\\mathcal\{N\}\(i\)\}\\Delta\{\\bm\{h\}\}\_\{i\\leftarrow j\},\\qquad\\Delta\{\\bm\{V\}\}\_\{i\}\\;=\\;\\frac\{1\}\{\|\\mathcal\{N\}\(i\)\|\}\\sum\_\{j\\in\\mathcal\{N\}\(i\)\}\\Delta\{\\bm\{v\}\}\_\{i\\leftarrow j\},\(23\)where𝒩\(i\)=\{j:\(i,j\)∈ℰ\}\\mathcal\{N\}\(i\)=\\\{j:\(i,j\)\\in\\mathcal\{E\}\\\}and the scalar gate𝒈i←j\{\\bm\{g\}\}\_\{i\\leftarrow j\}is broadcast over the spatial axis of𝒗j\{\\bm\{v\}\}\_\{j\}\. BecauseΔ𝒉i←j\\Delta\{\\bm\{h\}\}\_\{i\\leftarrow j\}and𝒈i←j\{\\bm\{g\}\}\_\{i\\leftarrow j\}are computed only from invariant quantities, they are E\(3\)\-invariant\. Because𝒗j\{\\bm\{v\}\}\_\{j\}is equivariant and the gate acts only on the channel axis,Δ𝒗i←j\\Delta\{\\bm\{v\}\}\_\{i\\leftarrow j\}andΔ𝑽i\\Delta\{\\bm\{V\}\}\_\{i\}are equivariant as well\. Each atom typically has at most a few incident edges \(for example,≤4\\leq 4covalent bonds\), so the adapter has complexity𝒪\(\|ℰ\|\)=𝒪\(N\)\\mathcal\{O\}\(\|\\mathcal\{E\}\|\)=\\mathcal\{O\}\(N\)with a small constant factor\.

In practice, the adapter can be inserted at the beginning of every block or only in the first block, or disabled altogether when no bond edge set is provided\. This flexibility lets the same backbone be reused across tasks that differ in whether explicit chemical adjacency is available\.

### 2\.7Space Complexity Analysis

We analyze the space \(memory\) complexity ofMeetfor a single structure withNNatoms, model dimensiondd, number of attention headsHH\(head dimensiondh=d/Hd\_\{h\}=d/H\),LLTransformer blocks, and\|ℰ\|\|\\mathcal\{E\}\|edges\. Following standard practice we set the feed\-forward hidden dimensiondff=Θ\(d\)d\_\{\\text\{ff\}\}=\\Theta\(d\), so that all width\-dependent terms can be expressed inddalone\. We separately account for*parameter memory*\(model weights\) and*activation memory*\(intermediate tensors retained during inference or, more critically, during training\)\. For batched inputs with per\-sample lengths\{Lb\}b=1B\\\{L\_\{b\}\\\}\_\{b=1\}^\{B\},NNshould be replaced by∑bLb\\sum\_\{b\}L\_\{b\}in the activation terms\.

#### Parameter memory\.

Model weights are independent ofNNand determined only byLLanddd\.

- •Feature initializationℐ\\mathcal\{I\}\.Scalar query/key projections contribute𝒪\(d2\)\\mathcal\{O\}\(d^\{2\}\), and the output projection from head space toddchannels contributes𝒪\(Hd\)=𝒪\(d2\)\\mathcal\{O\}\(Hd\)=\\mathcal\{O\}\(d^\{2\}\)\. One\-time cost is𝒪\(d2\)\\mathcal\{O\}\(d^\{2\}\)\.
- •Self\-attention layers\.Each block has independent scalar𝑸/𝑲/𝑼\{\\bm\{Q\}\}/\{\\bm\{K\}\}/\{\\bm\{U\}\}projections of size𝒪\(d2\)\\mathcal\{O\}\(d^\{2\}\), independent bias\-free vector𝑸/𝑲/𝑼\{\\bm\{Q\}\}/\{\\bm\{K\}\}/\{\\bm\{U\}\}projections of size𝒪\(d2\)\\mathcal\{O\}\(d^\{2\}\), two output projections of size𝒪\(d2\)\\mathcal\{O\}\(d^\{2\}\), and𝒪\(H\)\\mathcal\{O\}\(H\)learnable distance scales\. Per\-layer cost is𝒪\(d2\)\\mathcal\{O\}\(d^\{2\}\)\.
- •Feed\-forward layers\.The scalar and vector branches each maintain input and output projections of size𝒪\(d⋅dff\)=𝒪\(d2\)\\mathcal\{O\}\(d\\cdot d\_\{\\text\{ff\}\}\)=\\mathcal\{O\}\(d^\{2\}\)\. Per\-layer cost is𝒪\(d2\)\\mathcal\{O\}\(d^\{2\}\)\.
- •Bond adapter\.A small MLP maps2d\+de2d\+d\_\{e\}inputs to2d2doutputs, with cost𝒪\(d2\)\\mathcal\{O\}\(d^\{2\}\)in any block where the adapter is enabled\.

Summing over components, the total parameter memory is

𝒪\(d2\)⏟init\+𝒪\(Ld2\)⏟blocks=𝒪\(Ld2\)\.\\underbrace\{\\mathcal\{O\}\(d^\{2\}\)\}\_\{\\text\{init\}\}\\;\+\\;\\underbrace\{\\mathcal\{O\}\(Ld^\{2\}\)\}\_\{\\text\{blocks\}\}\\;=\\;\\mathcal\{O\}\(Ld^\{2\}\)\.\(24\)

#### Activation memory\.

- •Feature initialization\.One round of distance\-aware attention has the same activation profile as a single attention layer, giving𝒪\(N2H\)\\mathcal\{O\}\(N^\{2\}H\)under a naive kernel, or𝒪\(NH\)=𝒪\(Nd\)\\mathcal\{O\}\(NH\)=\\mathcal\{O\}\(Nd\)with a memory\-efficient attention kernel\. This is a one\-time cost dominated by theLL\-layer backbone below\.
- •Self\-attention\.A naive implementation materializes theN×NN\\times Nlogit matrix per head, requiring𝒪\(N2H\)\\mathcal\{O\}\(N^\{2\}H\)memory per layer\. With a memory\-efficient attention kernel the full matrix is never stored, and only running softmax statistics and tile\-sized buffers are maintained, reducing the per\-layer cost to𝒪\(Nd\)\\mathcal\{O\}\(Nd\)\. The distance\-aware formulation appends only99extra entries to queries and keys through the encoding in Equation equation[7](https://arxiv.org/html/2606.25006#S2.E7), so the effective head dimension increases fromdhd\_\{h\}todh\+9d\_\{h\}\+9but the asymptotic class is unchanged\. Crucially, no separateN×NN\\times Ndistance\-bias tensor is required\.
- •Feature streams and feed\-forward layer\.Each block maintains𝑯∈ℝN×d\{\\bm\{H\}\}\\in\\mathbb\{R\}^\{N\\times d\}and𝑽∈ℝN×3×d\{\\bm\{V\}\}\\in\\mathbb\{R\}^\{N\\times 3\\times d\}, together occupying𝒪\(Nd\)\\mathcal\{O\}\(Nd\), plus𝒪\(Nd\)\\mathcal\{O\}\(Nd\)for the feed\-forward intermediates\. The peak memory acrossLLlayers depends on the checkpointing strategy\. Without checkpointing it is𝒪\(LNd\)\\mathcal\{O\}\(LNd\), withL\\sqrt\{L\}\-interval checkpointing it reduces to𝒪\(LNd\)\\mathcal\{O\}\(\\sqrt\{L\}\\,Nd\)at the cost of one extra forward pass, and full recomputation stores only𝒪\(Nd\)\\mathcal\{O\}\(Nd\)activations at inference time\.
- •Bond adapter\.Message passing over a sparse bond edge set stores𝒪\(d\)\\mathcal\{O\}\(d\)intermediate features per edge, giving𝒪\(\|ℰ\|d\)\\mathcal\{O\}\(\|\\mathcal\{E\}\|d\)\. Under the chemical valence assumption\|ℰ\|≤cN\|\\mathcal\{E\}\|\\leq cN, this simplifies to𝒪\(Nd\)\\mathcal\{O\}\(Nd\)and is dominated by the feature\-stream cost\.

[Table˜1](https://arxiv.org/html/2606.25006#S2.T1)summarizes these results\. By \(i\) using memory\-efficient attention kernels and \(ii\) encoding distance information through query–key augmentation rather than through an explicit bias matrix,Meetavoids any𝒪\(N2\)\\mathcal\{O\}\(N^\{2\}\)activation term\. The overall peak activation memory therefore scales*linearly*inNNfor a fixed model size, making the architecture applicable to large molecular structures\.

Table 1:Space complexity ofMeet\(dff=Θ\(d\)d\_\{\\text\{ff\}\}=\\Theta\(d\)\), whereNNis the number of atoms,ddis the model dimension,LLis the number of Transformer blocks, and\|ℰ\|\|\\mathcal\{E\}\|is the number of edges with\|ℰ\|=𝒪\(N\)\|\\mathcal\{E\}\|=\\mathcal\{O\}\(N\)\. Activation costs assume no checkpointing, with variants discussed in the text\.

## 3Experiments

### 3\.1Datasets

We constructed our training set using a pipeline similar to that of the previously reported CPSea dataset, while extending its scope to linear peptides and adjusting several filtering thresholds\.

Briefly, we use AFDB\[varadi2022alphafold\]domains as source structures\. For each protein structure, we first compute secondary structure assignments, pLDDT scores, GRAVY scores, hydrophobicity annotations, and the residue\-level C\-distance matrix\. We then apply sliding windows of 3–13 residues to enumerate candidate segments and retain those with average GRAVY<0\.5<0\.5, hydrophobic residue ratio<0\.45<0\.45, helix ratio<0\.67<0\.67, sheet ratio<0\.34<0\.34, minimum pLDDT\>70\>70, and terminal amide C\-distance within3\.53\.5–15\.5Å15\.5\\penalty 10000\\ \\text\{\\AA \}for retained segments\.

The resulting candidates are further filtered using interface\-based criteria derived from buried surface area \(BSA\), requiring total BSA\>400Å2\>400\\penalty 10000\\ \\text\{\\AA \}^\{2\}, relative BSA between0\.350\.35and0\.850\.85, relative apolar BSA<0\.75<0\.75, and limited burial of the two terminal capping side chains \(<0\.30<0\.30each\)\. In addition, the receptor neighborhood is required to form a connected structural graph under a residue\-level connectivity cutoff of9\.0Å9\.0\\penalty 10000\\ \\text\{\\AA \}\. Candidates that pass all filters are finally clustered using a sequence\-overlap threshold of0\.20\.2\.

Starting from8\.648\.64million AFDB domains, this pipeline produces approximately100100million candidate segments and3131million clusters of peptide–protein complex structures\. From these, we randomly sample100K100\\mathrm\{K\}and1\.2M1\.2\\mathrm\{M\}structures for the training runs reported below\.

### 3\.2Memory Efficiency

We first evaluate the memory footprint ofMeetempirically and examine whether the implementation follows the linear scaling predicted by the analysis in Table[1](https://arxiv.org/html/2606.25006#S2.T1)\.

#### Setup\.

To avoid confounding the backbone comparison with variability in real complexes, we construct synthetic peptide chains of varying length at full\-atom resolution\. Each chain is composed of alanine residues\. For a given residue countnaan\_\{\\mathrm\{aa\}\}, we enumerate all heavy atoms in the alanine template, place residues along a linear backbone with3\.8Å3\.8\\penalty 10000\\ \\text\{\\AA \}spacing, add small Gaussian coordinate noise, and construct both intra\-residue and inter\-residue covalent bonds\. We sweepnaan\_\{\\mathrm\{aa\}\}over powers of two from22to1,0241\{,\}024, yielding atom counts from roughly1010to5,0005\{,\}000\. For each chain length, we record the peak GPU memory allocated during a single inference forward pass without gradient computation\.Meetand the original EPT\[jiao2025equivariantpretrainedtransformerunified\]backbone are evaluated under the same model dimension and number of attention heads\.

#### Overall comparison with EPT\.

Figure[2](https://arxiv.org/html/2606.25006#S3.F2)\(left\) reports peak allocated memory as a function of peptide length on a log–log scale\. Across the entire range,Meetconsistently uses less memory than EPT, and the gap widens as the chain becomes longer\. Atnaa=1,024n\_\{\\mathrm\{aa\}\}=1\{,\}024,Meetuses roughly300MB300\\penalty 10000\\ \\mathrm\{MB\}whereas EPT exceeds1,500MB1\{,\}500\\penalty 10000\\ \\mathrm\{MB\}, a reduction of approximately5×5\\times\. TheMeetcurve remains close to linear, matching the𝒪\(LNd\)\\mathcal\{O\}\(LNd\)activation scaling derived in Table[1](https://arxiv.org/html/2606.25006#S2.T1)\. By contrast, EPT follows a visibly steeper trend, consistent with the𝒪\(N2\)\\mathcal\{O\}\(N^\{2\}\)memory cost of explicitly materializing pairwise distance\-bias tensors\.

#### Per\-module breakdown\.

Figure[2](https://arxiv.org/html/2606.25006#S3.F2)\(right\) further decomposes the memory use ofMeetinto its four main components, namely FFN, self\-attention, bond adapter, and initialization\. All four curves grow approximately linearly asnaan\_\{\\mathrm\{aa\}\}increases from128128to1,0241\{,\}024, indicating that none of the modules introduces a hidden quadratic term\. The FFN and self\-attention layers account for most of the footprint, as expected from their𝒪\(Nd\)\\mathcal\{O\}\(Nd\)activation tensors\. The initialization and bond\-adapter modules contribute smaller but still linear overheads\.

![Refer to caption](https://arxiv.org/html/2606.25006v1/figs/ept_meet_infer_peak_alloc.png)

![Refer to caption](https://arxiv.org/html/2606.25006v1/figs/meet_alloc_each_module.png)

Figure 2:Memory efficiency ofMeet\.*\(Left\)*Inference peak memory versus peptide length forMeetand EPT\.Meetscales linearly, whereas EPT grows super\-linearly\.*\(Right\)*Per\-module memory breakdown ofMeet\. All four components grow approximately linearly with sequence length\.

### 3\.3Benchmark Results

We next compareMeetwith existing peptide design methods on the100K100\\mathrm\{K\}dataset described in Section[3\.1](https://arxiv.org/html/2606.25006#S3.SS1)\.

#### Baselines and model variants\.

We consider four recent generative baselines, PepGLAD\[kong2024pepglad\], PepFlow\[li2024pepflow\], UniMoMo\[kong2025unimomo\], and DiffPepBuilder\[wang2024diffpepbuilder\]\. For our approach, we evaluate two variants,MEET\-XS\(extra\-small\) andMEET\-B\(base\), so that the effect of backbone capacity can be assessed at a fixed data scale\.

#### Evaluation metrics\.

We emphasize two metrics that capture complementary aspects of design quality\. The first isΔG\\Delta G, the predicted binding free energy, where lower values indicate stronger predicted binding\. The second is PoseBuster\[buttenschoen2024posebusters\]pass rate \(PB\), which measures the physical validity of generated poses and is better when higher\. We also report shape complementarity \(Shape\), solvation\-normalized binding energy \(ΔG/ΔSASA\\Delta G/\\Delta\\mathrm\{SASA\}\), and sequence diversity \(Seq\. Div\.\)\.

#### Results\.

Table[2](https://arxiv.org/html/2606.25006#S3.T2)presents the benchmark results\. Both MEET variants outperform all baselines on the two primary metrics\. MEET\-B achieves a meanΔG\\Delta Gof−27\.40\-27\.40and a PoseBuster pass rate of0\.7990\.799, compared with−21\.80\-21\.80and0\.5610\.561for the strongest baseline, UniMoMo\. PepGLAD and PepFlow do not pass the PoseBuster checks in this evaluation, and DiffPepBuilder reaches only0\.0010\.001\. Even the smaller MEET\-XS already surpasses every baseline, with a meanΔG\\Delta Gof−25\.67\-25\.67and a PB of0\.6600\.660\. The same pattern is reflected in Shape andΔG/ΔSASA\\Delta G/\\Delta\\mathrm\{SASA\}, where MEET\-B obtains the best values among all methods\.

The main trade\-off at this scale appears in sequence diversity\. MEET\-B has lower Seq\. Div\. \(0\.7150\.715\) than PepGLAD \(0\.9390\.939\) and UniMoMo \(0\.9220\.922\), suggesting that the higher\-capacity model samples a more concentrated distribution when trained on100K100\\mathrm\{K\}examples\. Rather than treating this as an architectural limitation, we examine in the next subsection whether the effect changes with substantially more training data\.

Table 2:Benchmark results on the100K100\\mathrm\{K\}training set\. Best values arebolded\. LowerΔG\\Delta Gand higher PB are preferred\.

### 3\.4Scaling to a Larger Training Set

We then investigate how the system behaves when both the data scale and the capacity of the latent denoiser are increased\. In this experiment, the VAE is kept fixed, and theMeetdenoising network inside the LDM is scaled on the1\.2M1\.2\\mathrm\{M\}dataset described in Section[3\.1](https://arxiv.org/html/2606.25006#S3.SS1)\.

#### Model configurations\.

We follow the depth and hidden\-dimension conventions of DiT\[peebles2023scalable\], using S, B, and L variants for the LDM denoising backbone\. We also include an XS variant with half the S\-model depth\.

#### Training dynamics\.

Figure[3](https://arxiv.org/html/2606.25006#S3.F3)shows training and validation loss curves over100K100\\mathrm\{K\}optimization steps\. Larger backbones consistently reach lower loss, and the ordering MEET\-L<<MEET\-B<<MEET\-S<<MEET\-XS is maintained on both splits throughout training\. This trend indicates that the memory\-efficient design ofMeetdoes not prevent the denoiser from exploiting additional capacity\. Instead, the backbone remains effective as the LDM is scaled to larger models\.

![Refer to caption](https://arxiv.org/html/2606.25006v1/figs/loss_train_1250k.png)

![Refer to caption](https://arxiv.org/html/2606.25006v1/figs/loss_val_1250k.png)

Figure 3:Training*\(left\)*and validation*\(right\)*loss curves on the1\.2M1\.2\\mathrm\{M\}dataset\. Larger LDM backbones consistently achieve lower loss, showing favorable scaling behavior\.
#### Generation quality\.

Table[3](https://arxiv.org/html/2606.25006#S3.T3)reports downstream generation metrics for all four model sizes\. As capacity increases from XS to L, meanΔG\\Delta Gimproves from−26\.26\-26\.26to−28\.22\-28\.22and PB increases from0\.7030\.703to0\.7320\.732\. Shape complementarity and solvation efficiency follow the same trend\. The effect of data scale is also clear, as MEET\-XS trained on1\.2M1\.2\\mathrm\{M\}already matches or exceeds the100K100\\mathrm\{K\}MEET\-B setting across most metrics\.

Importantly, sequence diversity improves substantially at the larger data scale\. MEET\-L retains a Seq\. Div\. of0\.8990\.899, compared with0\.7150\.715for MEET\-B on100K100\\mathrm\{K\}\. This suggests that the reduced diversity observed in Table[2](https://arxiv.org/html/2606.25006#S3.T2)is primarily a data\-scale effect rather than an inherent consequence of theMeetbackbone\.

Table 3:Generation quality on the1\.2M1\.2\\mathrm\{M\}training set across LDM backbone scales\. Best results arebolded\.Taken together, these results show that the memory\-efficient architecture ofMeetenables practical scaling along both the model\-capacity and data axes, and that this scaling translates into consistent improvements in peptide design quality\.

## 4Conclusion

We introducedMeet, a memory\-efficient E\(3\)\-equivariant Transformer backbone for full\-atom peptide design\. Its main design principle is to express geometric computation in forms compatible with memory\-efficient attention, including distance\-aware query–key augmentation, global vector initialization, and sparse bond adaptation\. This removes the quadratic activation bottleneck of dense geometric attention while preserving coupled invariant and equivariant feature streams\.

Integrated into a VAE and latent diffusion framework,Meetimproves both efficiency and generation quality\. The backbone shows linear memory scaling in peptide length, outperforms prior peptide design methods on binding affinity and physical validity, and supports systematic scaling of the latent denoiser on larger training data\. These results indicate that backbone efficiency is not only an implementation concern, but a key enabler for scaling full\-atom generative models\.

## References

\\beginappendix

## 5Introduction of Latent Generative Framework

We instantiateMeetin a two\-stage latent generative framework for target\-specific full\-atom peptide design\. The overall design follows the motivation of latent generative modeling in UniMoMo\[kong2025unimomo\], where detailed atomistic structures are first compressed into a compact latent representation and conditional generation is then performed in this lower\-dimensional space\. This separation is useful for peptide design because the atomic system contains many local constraints, while the global design decision is naturally organized at the residue or block level\.

#### Atom\-level VAE\.

For a peptide–protein complex, let𝑿∈ℝN×3\{\\bm\{X\}\}\\in\\mathbb\{R\}^\{N\\times 3\}denote atom coordinates, letAiA\_\{i\}denote atom types, and letSbS\_\{b\}denote the block type of residue or fragmentbb\. The VAE maps the full\-atom complex into block\-level latent variables

𝒁=\(𝒁H,𝒁X\),𝒁H∈ℝM×dz,𝒁X∈ℝM×3,\{\\bm\{Z\}\}=\(\{\\bm\{Z\}\}\_\{H\},\{\\bm\{Z\}\}\_\{X\}\),\\qquad\{\\bm\{Z\}\}\_\{H\}\\in\\mathbb\{R\}^\{M\\times d\_\{z\}\},\\qquad\{\\bm\{Z\}\}\_\{X\}\\in\\mathbb\{R\}^\{M\\times 3\},\(25\)whereMMis the number of blocks\. The invariant latent𝒁H\{\\bm\{Z\}\}\_\{H\}captures block identity and local chemical context, whereas the equivariant latent𝒁X\{\\bm\{Z\}\}\_\{X\}records block\-level geometry\. The encoder first computes atom\-level scalar and vector features withMeet, then aggregates atom features within each block to parameterize an approximate posterior

qϕ\(𝒁∣𝑿,A,S\)=∏b=1Mqϕ\(𝒁H,b∣𝑿,A,S\)qϕ\(𝒁X,b∣𝑿,A,S\)\.q\_\{\\phi\}\(\{\\bm\{Z\}\}\\mid\{\\bm\{X\}\},A,S\)=\\prod\_\{b=1\}^\{M\}q\_\{\\phi\}\(\{\\bm\{Z\}\}\_\{H,b\}\\mid\{\\bm\{X\}\},A,S\)\\,q\_\{\\phi\}\(\{\\bm\{Z\}\}\_\{X,b\}\\mid\{\\bm\{X\}\},A,S\)\.\(26\)We regularize the invariant latent toward a standard Gaussian prior and the coordinate latent toward a Gaussian centered at the corresponding block center𝒓b\{\\bm\{r\}\}\_\{b\},

p\(𝒁\)=∏b=1M𝒩\(𝒁H,b;𝟎,𝑰\)𝒩\(𝒁X,b;𝒓b,𝑰\)\.p\(\{\\bm\{Z\}\}\)=\\prod\_\{b=1\}^\{M\}\\mathcal\{N\}\(\{\\bm\{Z\}\}\_\{H,b\};\{\\bm\{0\}\},\{\\bm\{I\}\}\)\\,\\mathcal\{N\}\(\{\\bm\{Z\}\}\_\{X,b\};\{\\bm\{r\}\}\_\{b\},\{\\bm\{I\}\}\)\.\(27\)This prior encourages each block latent point to remain close to the full\-atom geometry it represents, while still allowing the latent diffusion model to later model global peptide placement\. In our framework, we do not introduce separate learned block or chain embeddings\. The block abstraction is instead induced by atom\-level encoding, block\-wise pooling, position information, and the latent variables themselves\.

The decoder factorizes reconstruction into sequence prediction and coordinate generation\. The sequence decoder predicts block types from the latent point cloud,

pξ\(S∣𝒁\)=∏b=1Mpξ\(Sb∣𝒁H,𝒁X\),p\_\{\\xi\}\(S\\mid\{\\bm\{Z\}\}\)=\\prod\_\{b=1\}^\{M\}p\_\{\\xi\}\(S\_\{b\}\\mid\{\\bm\{Z\}\}\_\{H\},\{\\bm\{Z\}\}\_\{X\}\),\(28\)and is trained with cross entropy\. Given the decoded or ground\-truth block types, the structure decoder reconstructs atom coordinates under a bond graph specified by the target topology, peptide chain adjacency, predicted block identities, and terminal capping rules, avoiding a separate bond\-generation distribution\.

Coordinate reconstruction is trained as a continuous flow\-matching problem\[lipman2023flowmatchinggenerativemodeling\]\. For each atom, an initial coordinate𝑿prior\{\\bm\{X\}\}\_\{\\mathrm\{prior\}\}is sampled around the corresponding block latent coordinate, and a timet∼𝒰\(0,1\)t\\sim\\mathcal\{U\}\(0,1\)is drawn\. We define the interpolation

𝑿t=𝑿prior\+\(1−t\)\(𝑿−𝑿prior\),𝒖⋆=𝑿−𝑿prior,\{\\bm\{X\}\}\_\{t\}=\{\\bm\{X\}\}\_\{\\mathrm\{prior\}\}\+\(1\-t\)\(\{\\bm\{X\}\}\-\{\\bm\{X\}\}\_\{\\mathrm\{prior\}\}\),\\qquad\{\\bm\{u\}\}^\{\\star\}=\{\\bm\{X\}\}\-\{\\bm\{X\}\}\_\{\\mathrm\{prior\}\},\(29\)so thatt=1t=1corresponds to the prior state andt=0t=0corresponds to the data structure\. The structure decoder predicts a vector field𝒖ξ\(𝑿t,t,𝒁,S\)\{\\bm\{u\}\}\_\{\\xi\}\(\{\\bm\{X\}\}\_\{t\},t,\{\\bm\{Z\}\},S\)and is trained to match𝒖⋆\{\\bm\{u\}\}^\{\\star\}\. The full VAE objective can be written as

ℒVAE\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{VAE\}\}=λseqCE\(S,S^\)\+λcoordlig‖𝑴lig⊙\(𝒖ξ−𝒖⋆\)‖22\+λcoordpoc‖𝑴poc⊙\(𝒖ξ−𝒖⋆\)‖22\\displaystyle=\\lambda\_\{\\mathrm\{seq\}\}\\,\\mathrm\{CE\}\\\!\\left\(S,\\hat\{S\}\\right\)\+\\lambda\_\{\\mathrm\{coord\}\}^\{\\mathrm\{lig\}\}\\,\\bigl\\\|\{\\bm\{M\}\}\_\{\\mathrm\{lig\}\}\\odot\(\{\\bm\{u\}\}\_\{\\xi\}\-\{\\bm\{u\}\}^\{\\star\}\)\\bigr\\\|\_\{2\}^\{2\}\+\\lambda\_\{\\mathrm\{coord\}\}^\{\\mathrm\{poc\}\}\\,\\bigl\\\|\{\\bm\{M\}\}\_\{\\mathrm\{poc\}\}\\odot\(\{\\bm\{u\}\}\_\{\\xi\}\-\{\\bm\{u\}\}^\{\\star\}\)\\bigr\\\|\_\{2\}^\{2\}\+λdistℒdist\+λHDKL\(qϕ\(𝒁H∣𝑿,A,S\)∥p\(𝒁H\)\)\+λXDKL\(qϕ\(𝒁X∣𝑿,A,S\)∥p\(𝒁X\)\)\.\\displaystyle\\quad\+\\lambda\_\{\\mathrm\{dist\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{dist\}\}\+\\lambda\_\{H\}\\,D\_\{\\mathrm\{KL\}\}\\\!\\left\(q\_\{\\phi\}\(\{\\bm\{Z\}\}\_\{H\}\\mid\{\\bm\{X\}\},A,S\)\\,\\\|\\,p\(\{\\bm\{Z\}\}\_\{H\}\)\\right\)\+\\lambda\_\{X\}\\,D\_\{\\mathrm\{KL\}\}\\\!\\left\(q\_\{\\phi\}\(\{\\bm\{Z\}\}\_\{X\}\\mid\{\\bm\{X\}\},A,S\)\\,\\\|\\,p\(\{\\bm\{Z\}\}\_\{X\}\)\\right\)\.\(30\)Here𝑴lig\{\\bm\{M\}\}\_\{\\mathrm\{lig\}\}and𝑴poc\{\\bm\{M\}\}\_\{\\mathrm\{poc\}\}select ligand and pocket atoms, andℒdist\\mathcal\{L\}\_\{\\mathrm\{dist\}\}preserves local geometric distances\. During VAE decoding, the coordinates are initialized from𝑿prior\{\\bm\{X\}\}\_\{\\mathrm\{prior\}\}and updated along a decreasing time grid1=tK\>⋯\>t0=01=t\_\{K\}\>\\cdots\>t\_\{0\}=0by

𝑿tk−1=𝑿tk\+\(tk−tk−1\)𝒖ξ\(𝑿tk,tk,𝒁,S^\),\{\\bm\{X\}\}\_\{t\_\{k\-1\}\}=\{\\bm\{X\}\}\_\{t\_\{k\}\}\+\(t\_\{k\}\-t\_\{k\-1\}\)\\,\{\\bm\{u\}\}\_\{\\xi\}\(\{\\bm\{X\}\}\_\{t\_\{k\}\},t\_\{k\},\{\\bm\{Z\}\},\\hat\{S\}\),\(31\)which transports atoms from the latent prior toward a full\-atom structure\.

#### Latent diffusion model\.

After VAE training, the autoencoder is frozen and a conditional latent diffusion model is trained on the VAE latents\. Let𝑪\{\\bm\{C\}\}denote the target\-pocket context encoded by the frozen VAE, and let𝒁0=\(𝒁H,0,𝒁X,0\)\{\\bm\{Z\}\}\_\{0\}=\(\{\\bm\{Z\}\}\_\{H,0\},\{\\bm\{Z\}\}\_\{X,0\}\)denote the clean peptide latents\. The coordinate latent is centered by the pocket center and normalized by a fixed coordinate scale before diffusion training\.

Both𝒁H\{\\bm\{Z\}\}\_\{H\}and𝒁X\{\\bm\{Z\}\}\_\{X\}are modeled with continuous\-time cosine diffusion paths inspired by DDPM formulations\[ho2020denoisingdiffusionprobabilisticmodels,nichol2021improveddenoisingdiffusionprobabilistic\]\. For each latent fieldr∈\{H,X\}r\\in\\\{H,X\\\}, the forward process samples

𝒁r,t=η\(t\)𝒁r,0\+σ\(t\)ϵr,t∼𝒰\(0,1\),ϵr∼𝒩\(𝟎,𝑰\),\{\\bm\{Z\}\}\_\{r,t\}=\\eta\(t\)\{\\bm\{Z\}\}\_\{r,0\}\+\\sigma\(t\)\{\\bm\{\\epsilon\}\}\_\{r\},\\qquad t\\sim\\mathcal\{U\}\(0,1\),\\qquad\{\\bm\{\\epsilon\}\}\_\{r\}\\sim\\mathcal\{N\}\(\{\\bm\{0\}\},\{\\bm\{I\}\}\),\(32\)whereη\(t\)\\eta\(t\)andσ\(t\)\\sigma\(t\)are the cosine schedule coefficients\. The denoising networkϵθ\{\\bm\{\\epsilon\}\}\_\{\\theta\}is aMeetbackbone conditioned on𝑪\{\\bm\{C\}\}and trained to predict the injected noise\. The LDM objective is

ℒLDM=𝔼t,𝒁0,ϵH,ϵX\[‖ϵθH\(𝒁H,t,𝒁X,t,t,𝑪\)−ϵH‖22\+‖ϵθX\(𝒁H,t,𝒁X,t,t,𝑪\)−ϵX‖22\]\.\\mathcal\{L\}\_\{\\mathrm\{LDM\}\}=\\mathbb\{E\}\_\{t,\{\\bm\{Z\}\}\_\{0\},\{\\bm\{\\epsilon\}\}\_\{H\},\{\\bm\{\\epsilon\}\}\_\{X\}\}\\left\[\\left\\\|\{\\bm\{\\epsilon\}\}\_\{\\theta\}^\{H\}\(\{\\bm\{Z\}\}\_\{H,t\},\{\\bm\{Z\}\}\_\{X,t\},t,\{\\bm\{C\}\}\)\-\{\\bm\{\\epsilon\}\}\_\{H\}\\right\\\|\_\{2\}^\{2\}\+\\left\\\|\{\\bm\{\\epsilon\}\}\_\{\\theta\}^\{X\}\(\{\\bm\{Z\}\}\_\{H,t\},\{\\bm\{Z\}\}\_\{X,t\},t,\{\\bm\{C\}\}\)\-\{\\bm\{\\epsilon\}\}\_\{X\}\\right\\\|\_\{2\}^\{2\}\\right\]\.\(33\)
At inference time, the target pocket is first encoded into𝑪\{\\bm\{C\}\}\. Peptide latents are initialized from Gaussian noise att=1t=1, and the reverse trajectory is solved with the probability\-flow ODE sampler\[song2021scorebasedgenerativemodelingstochastic\], denoted as DiffusionODE in our experiments\. For a DDPM probability path, the corresponding forward SDE has drift and diffusion coefficients

f\(𝒁r,t,t\)=−12β\(t\)𝒁r,t,g\(t\)=β\(t\)\.f\(\{\\bm\{Z\}\}\_\{r,t\},t\)=\-\\frac\{1\}\{2\}\\beta\(t\)\{\\bm\{Z\}\}\_\{r,t\},\\qquad g\(t\)=\\sqrt\{\\beta\(t\)\}\.\(34\)Since the denoiser predicts the noise, the score is estimated as

𝒔θr\(𝒁H,t,𝒁X,t,t,𝑪\)=−ϵθr\(𝒁H,t,𝒁X,t,t,𝑪\)σ\(t\),r∈\{H,X\}\.\{\\bm\{s\}\}\_\{\\theta\}^\{r\}\(\{\\bm\{Z\}\}\_\{H,t\},\{\\bm\{Z\}\}\_\{X,t\},t,\{\\bm\{C\}\}\)=\-\\frac\{\{\\bm\{\\epsilon\}\}\_\{\\theta\}^\{r\}\(\{\\bm\{Z\}\}\_\{H,t\},\{\\bm\{Z\}\}\_\{X,t\},t,\{\\bm\{C\}\}\)\}\{\\sigma\(t\)\},\\qquad r\\in\\\{H,X\\\}\.\(35\)The probability\-flow ODE is then

d𝒁r,tdt=f\(𝒁r,t,t\)−12g\(t\)2𝒔θr\(𝒁H,t,𝒁X,t,t,𝑪\),r∈\{H,X\}\.\\frac\{\\mathrm\{d\}\{\\bm\{Z\}\}\_\{r,t\}\}\{\\mathrm\{d\}t\}=f\(\{\\bm\{Z\}\}\_\{r,t\},t\)\-\\frac\{1\}\{2\}g\(t\)^\{2\}\{\\bm\{s\}\}\_\{\\theta\}^\{r\}\(\{\\bm\{Z\}\}\_\{H,t\},\{\\bm\{Z\}\}\_\{X,t\},t,\{\\bm\{C\}\}\),\\qquad r\\in\\\{H,X\\\}\.\(36\)Using a decreasing time grid1=tK\>⋯\>t0=01=t\_\{K\}\>\\cdots\>t\_\{0\}=0, the sampler applies an Euler update

𝒁r,tk−1=𝒁r,tk\+\(tk−1−tk\)\[f\(𝒁r,tk,tk\)−12g\(tk\)2𝒔θr\(𝒁H,tk,𝒁X,tk,tk,𝑪\)\]\.\{\\bm\{Z\}\}\_\{r,t\_\{k\-1\}\}=\{\\bm\{Z\}\}\_\{r,t\_\{k\}\}\+\(t\_\{k\-1\}\-t\_\{k\}\)\\left\[f\(\{\\bm\{Z\}\}\_\{r,t\_\{k\}\},t\_\{k\}\)\-\\frac\{1\}\{2\}g\(t\_\{k\}\)^\{2\}\{\\bm\{s\}\}\_\{\\theta\}^\{r\}\(\{\\bm\{Z\}\}\_\{H,t\_\{k\}\},\{\\bm\{Z\}\}\_\{X,t\_\{k\}\},t\_\{k\},\{\\bm\{C\}\}\)\\right\]\.\(37\)On the final step, the solver returns the corresponding clean\-latent prediction

𝒁^r,0=𝒁r,t−σ\(t\)ϵθr\(𝒁H,t,𝒁X,t,t,𝑪\)η\(t\)\.\\hat\{\{\\bm\{Z\}\}\}\_\{r,0\}=\\frac\{\{\\bm\{Z\}\}\_\{r,t\}\-\\sigma\(t\)\{\\bm\{\\epsilon\}\}\_\{\\theta\}^\{r\}\(\{\\bm\{Z\}\}\_\{H,t\},\{\\bm\{Z\}\}\_\{X,t\},t,\{\\bm\{C\}\}\)\}\{\\eta\(t\)\}\.\(38\)This procedure produces peptide block latents att=0t=0\. The frozen VAE decoder then predicts the peptide sequence, constructs the bond graph from the decoded blocks and peptide topology, initializes atom coordinates around𝒁X\{\\bm\{Z\}\}\_\{X\}, and applies the coordinate flow decoder to generate the final full\-atom peptide–protein complex\. In this way,Meetis used in the encoder, the VAE decoders, and the latent denoiser, making the scalability of the backbone central to the efficiency of the entire pipeline\.

## 6Evaluation Metrics

The benchmark tables report five evaluation metrics, including predicted binding free energyΔG\\Delta G, PoseBuster pass rate \(PB\), shape complementarity \(Shape\), solvation\-normalized binding energyΔG/ΔSASA\\Delta G/\\Delta\\mathrm\{SASA\}, and sequence diversity \(Seq\. Div\.\)\. These metrics are chosen to jointly assess binding affinity, physical validity, interface packing, energetic efficiency, and sample diversity\.

𝚫𝑮\\Delta G\.This metric measures the predicted binding free energy of the generated peptide against the target pocket\. For each generated complex, the structure is first relaxed under coordinate constraints, and the peptide–target interface is then scored with a PyRosetta interface energy function\[Chaudhury\_2010\]\. LowerΔG\\Delta Gindicates a more favorable predicted binding interaction\. We report both the mean and the median across generated samples, since the mean reflects overall sample quality while the median is less sensitive to a small number of very strong or very weak designs\.

PB\.This metric measures the fraction of generated poses that pass PoseBuster\[buttenschoen2024posebusters\]\. PoseBuster evaluates whether a generated peptide pose is chemically and geometrically plausible, including molecular parsing, bond geometry, internal clashes, flatness constraints, and intermolecular clashes with the target pocket\. A sample is counted as valid only when all selected checks are passed\. Higher PB therefore indicates that a method produces fewer physically implausible complexes\.

Shape\.This metric denotes the PyRosetta shape\-complementarity score of the peptide–target interface after the same constrained relaxation used for interface scoring\[Chaudhury\_2010\]\. It measures how well the molecular surfaces of the generated peptide and target pocket pack against each other\. Higher Shape values indicate tighter and more geometrically complementary interfaces, independent of whether the peptide sequence matches the reference binder for that target\.

𝚫𝑮/𝚫𝐒𝐀𝐒𝐀\\Delta G/\\Delta\\mathrm\{SASA\}\.This metric normalizes the predicted binding free energy by the buried solvent\-accessible surface area at the interface\. This metric distinguishes designs that obtain favorable energy through efficient local interactions from those that rely mainly on forming a larger buried interface\. Lower values indicate stronger predicted binding per unit buried area\. As withΔG\\Delta G, we report both the mean and the median across generated samples for each method\.

Seq\. Div\.This metric measures the diversity of generated peptide sequences for the same target\. For each target pocket, we compare all pairs of generated sequences and compute their pairwise dissimilarity as one minus their position\-wise amino\-acid recovery\. The score is averaged within each target and then averaged across targets\. Higher Seq\. Div\. indicates that the model can propose a broader set of peptide sequences rather than repeatedly sampling near\-identical binders\.

## 7Hyperparameter Settings

We list only the hyperparameters that determine model capacity, latent dimensionality, objective balance, and optimization across the reported runs\.

Table G\.1:Key hyperparameters of the VAE\.NameValueDescriptionEncoderBackboneMeetFull\-atom encoder with bond\-aware layers\.Depth66Number of encoder layers\.Hidden size128128Scalar and vector channel dimension\.Attention heads88Number of attention heads\.Latent size88Dimension of the invariant block latent state\.Sequence DecoderBackboneMeetDecoder for peptide block\-type prediction\.Depth33Number of decoder layers\.Hidden size128128Hidden dimension shared with the encoder\.Attention heads88Number of attention heads\.Structure DecoderBackboneMeetFull\-atom coordinate decoder with specified bonds\.Depth66Number of decoder layers\.Hidden size128128Hidden dimension shared with the encoder\.Attention heads88Number of attention heads\.Coordinate prior std\.1\.01\.0Standard deviation of the coordinate prior\.Decode steps1010Number of coordinate\-flow decoding iterations\.Training RegimeTraining steps100,000100\{,\}000Number of optimization steps\.Batch size10241024Training batch size\.Learning rate1\.0×10−31\.0\\times 10^\{\-3\}Initial learning rate with cosine decay\.OptimizerAdamWWeight decay1\.0×10−51\.0\\times 10^\{\-5\}\.λH\\lambda\_\{H\},λX\\lambda\_\{X\}0\.60\.6,0\.80\.8Weights for latent KL terms\.λseq\\lambda\_\{\\mathrm\{seq\}\}1\.01\.0Weight for block\-type cross entropy\.λcoordlig\\lambda\_\{\\mathrm\{coord\}\}^\{\\mathrm\{lig\}\}1\.01\.0Weight for peptide\-atom vector\-field matching\.λcoordpoc\\lambda\_\{\\mathrm\{coord\}\}^\{\\mathrm\{poc\}\}1\.01\.0Weight for pocket\-atom vector\-field matching\.λdist\\lambda\_\{\\mathrm\{dist\}\}0\.50\.5Weight for preserving local geometry\.Table G\.2:Key hyperparameters of the latent diffusion models\.For all LDM variants, the VAE is frozen during LDM training and the learning rate follows cosine decay\. In our runs, we find that larger LDM backbones required smaller learning rates for stable and effective optimization, with XS, S, B and L using1\.0×10−31\.0\\times 10^\{\-3\},5\.0×10−45\.0\\times 10^\{\-4\},3\.0×10−43\.0\\times 10^\{\-4\}, and1\.0×10−41\.0\\times 10^\{\-4\}, respectively\.
Scalable Peptide Design via Memory-Efficient Equivariant Transformer

Similar Articles

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

Pepti-Agent: An AI Agent for Peptide Design and Optimization

Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

Scaling Self-Evolving Agents via Parametric Memory

Submit Feedback

Similar Articles

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
Pepti-Agent: An AI Agent for Peptide Design and Optimization
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Scaling Self-Evolving Agents via Parametric Memory