AgForce Enables Antigen-conditioned Generative Antibody Design

arXiv cs.LG Papers

Summary

This paper identifies three failure modes in existing antibody design methods (antigen blindness, vocabulary collapse, convergence to marginal distribution) and proposes AgForce, a novel encoder-decoder architecture using graph neural networks and mixture density networks, achieving state-of-the-art binding quality and sequence recovery on the Chimera-Bench benchmark.

arXiv:2605.21610v1 Announce Type: new Abstract: Antibody design methods condition on antigen structure to generate complementarity-determining regions (CDR), yet a systematic evaluation of baseline methods reveals that they largely ignore the antigen input. We identify three failure modes that explain this behavior. Antigen blindness arises because models derive predictions from antibody framework context rather than antigen information, producing nearly identical CDRs regardless of the target. Vocabulary collapse reduces predicted amino acids to three to five per position, far below the ground truth distribution in native sequences. Moreover, any model trained with standard per-position cross-entropy converges to the positional marginal distribution, making it provably unable to produce antigen-specific sequence predictions. We propose a novel encoder-decoder architecture called AgForce, that uses a graph neural network (GNN) as the encoder and specialized decoders for sequence-structure co-design. Specifically, we apply framework dropout, gated bottlenecks, and hyperbolic cross attention that prevent the antibody shortcut path. In the decoder, a Mixture Density Network (MDN) sequence head with Potts-like pairwise coupling and annealed Multiple Choice Learning (aMCL) replaces the cross-entropy objective with a multi-component distribution whose optimal solution differs from the positional marginal. An antigen cycle consistency head routes gradients through the sequence decoder, forcing predicted distributions to encode antigen identity. AgForce achieves the best binding quality and sequence recovery simultaneously on the CHIMERA-Bench dataset, improving amino acid recovery by 8% over the strongest sequence baseline while surpassing the baselines across all interface metrics, and nearly doubling the effective vocabulary of GNN methods. The source code is available at: https://github.com/mansoor181/ag-force.git
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:50 AM

# AgForce Enables Antigen-conditioned Generative Antibody Design
Source: [https://arxiv.org/html/2605.21610](https://arxiv.org/html/2605.21610)
Mansoor Ahmed1,2,Murray Patterson1\* 1Georgia State University, Atlanta, GA, USA 2Georgia Institute of Technology, Atlanta, GA, USA

###### Abstract

Antibody design methods condition on antigen structure to generate complementarity\-determining regions \(CDR\), yet a systematic evaluation of baseline methods reveals that they largely ignore the antigen input\. We identify three failure modes that explain this behavior\.Antigen blindnessarises because models derive predictions from antibody framework context rather than antigen information, producing nearly identical CDRs regardless of the target\.Vocabulary collapsereduces predicted amino acids to three to five per position, far below the ground truth distribution in native sequences\. Moreover, any model trained with standardper\-position cross\-entropyconverges to the positional marginal distribution, making it provably unable to produce antigen\-specific sequence predictions\. We propose a novel encoder\-decoder architecture calledAgForce, that uses a graph neural network \(GNN\) as the encoder and specialized decoders for sequence\-structure co\-design\. Specifically, we apply framework dropout, gated bottlenecks, and hyperbolic cross attention that prevent the antibody shortcut path\. In the decoder, a Mixture Density Network \(MDN\) sequence head with Potts\-like pairwise coupling and annealed Multiple Choice Learning \(aMCL\) replaces the cross\-entropy objective with a multi\-component distribution whose optimal solution differs from the positional marginal\. An antigen cycle consistency head routes gradients through the sequence decoder, forcing predicted distributions to encode antigen identity\.AgForceachieves the best binding quality and sequence recovery simultaneously on theChimera\-Benchbenchmark, improving amino acid recovery by 8% over the strongest sequence baseline while surpassing the baselines across all interface metrics, and nearly doubling the effective vocabulary of GNN methods\. The source code is available at:[https://github\.com/mansoor181/ag\-force\.git](https://github.com/mansoor181/ag-force.git)

## 1Introduction

Antibodies recognize and neutralize foreign antigens through their complementarity\-determining regions \(CDRs\), six hypervariable loops that form the primary binding interface\(Potocnakovaet al\.,[2016](https://arxiv.org/html/2605.21610#bib.bib141)\)\. Among these, CDR\-H3 exhibits the greatest sequence and structural diversity and contributes most to antigen specificity\(Chothia and Lesk,[1987](https://arxiv.org/html/2605.21610#bib.bib276)\)\. Designing CDR sequences and backbone structures that bind a target epitope is a central challenge in therapeutic antibody engineering\(Hummeret al\.,[2022](https://arxiv.org/html/2605.21610#bib.bib57)\), and a growing body of deep generative models now address this problem by conditioning on antigen structure\(Luoet al\.,[2022](https://arxiv.org/html/2605.21610#bib.bib250); Konget al\.,[2023a](https://arxiv.org/html/2605.21610#bib.bib254),[b](https://arxiv.org/html/2605.21610#bib.bib249); Vermaet al\.,[2023](https://arxiv.org/html/2605.21610#bib.bib129); Wuet al\.,[2025b](https://arxiv.org/html/2605.21610#bib.bib199); Abiret al\.,[2025](https://arxiv.org/html/2605.21610#bib.bib257); Tanet al\.,[2025](https://arxiv.org/html/2605.21610#bib.bib252)\)\. A natural expectation is that conditioning on the antigen should produce designs tailored to the target\. Recent evidence suggests otherwise: unigram frequency alone explains most predictions\(Konget al\.,[2023b](https://arxiv.org/html/2605.21610#bib.bib249)\), BLOSUM substitution scores predict model outputs as well as learned likelihoods\(Uçar and Sormanni,[2025](https://arxiv.org/html/2605.21610#bib.bib282)\), removing the antigen chain leaves predictions nearly unchanged\(Liet al\.,[2025](https://arxiv.org/html/2605.21610#bib.bib332)\), and including the antigen sequence does not improve design quality\(Kimet al\.,[2024](https://arxiv.org/html/2605.21610#bib.bib333)\)\. These findings point to a systematic failure of existing conditioning mechanisms, but no prior work has identified the root causes or proposed principled remedies\.

We diagnose three causally linked failure modes that explain why the strongest CDR design paradigm, equivariant graph neural networks with greedy decoding, fails to condition on antigen information\.Antigen blindness: all methods we evaluate show substantially lower recovery at antigen\-contacting positions than non\-contacting positions, and an antigen\-free baseline achieves the strongest binding quality, confirming that existing conditioning mechanisms contribute little\.Vocabulary collapse: GNN methods with greedy decoding produce effective vocabularies of three to five amino acids per position, far below native diversity, with rare but biochemically important residues almost never predicted\. Thecross\-entropy ceilingprovides a unifying explanation: the standard per\-position cross\-entropy loss has its optimum at the positional marginal distribution regardless of conditioning signal, forcing the model to ignore the antigen and concentrate on a few amino acids per position\.

![Refer to caption](https://arxiv.org/html/2605.21610v1/figures/h3_positional_trio.png)

![Refer to caption](https://arxiv.org/html/2605.21610v1/figures/veff_vs_aar.png)

Figure 1:\(a\) Vocabulary collapse:Heatmap of the ground truth amino acid distribution of CDR H3, the predictions by dyMEAN, and our proposed model,AgForce\. \(b\) AAR vs\. effective vocabulary\.To address these failure modes, we introduce a novel encoder\-decoder architecture calledAgForce, that uses an E\(3\) graph neural network \(EGNN\) as the encoder and specialized decoders for sequence\-structure co\-design\. The decoder employs a mixture density network with Potts\-like pairwise coupling, and annealed multiple choice learning replaces the standard sequence head, provably breaking the cross entropy ceiling by enabling component specialization\. Similarly, a framework dropout module corrupts the antibody context during training, and an antigen consistency head routes gradients through the sequence decoder to enforce antigen conditioning\. For vocabulary collapse, the multi\-component MDN permits diverse predictions across components, and GDPP spectral regularization penalizes deviations from native diversity\. Our contributions:

1. 1\.Adiagnostic frameworkidentifying three causally linked failure modes in equivariant GNN methods for CDR design that per\-position cross\-entropy cannot produce antigen\-specific predictions, verified empirically across baselines spanning five architectural families\.
2. 2\.AgForce, which introduces targeted interventions for each diagnosed failure: an MDN\-Potts sequence head with aMCL training, framework dropout, and an antigen consistency loss, and virtual\-node\-augmented equivariant message passing with hyperbolic cross\-attention\. On theChimera\-Benchbenchmark,AgForceachieves the best binding quality and sequence recovery simultaneously, improving amino acid recovery by 8% over the strongest sequence baseline while surpassing the strongest binding baseline across all interface metrics, and nearly doubling the effective vocabulary of GNN methods\.

## 2Related Work

#### Equivariant GNN methods\.

MEAN\(Konget al\.,[2023a](https://arxiv.org/html/2605.21610#bib.bib254)\)formulates CDR design as E\(3\)\-equivariant graph translation with multi\-channel attention\. dyMEAN\(Konget al\.,[2023b](https://arxiv.org/html/2605.21610#bib.bib249)\)extends this to full\-atom design with a shadow paratope loss\. RAAD\(Wuet al\.,[2025b](https://arxiv.org/html/2605.21610#bib.bib199)\)adds relation\-aware edge features and a contrastive specificity loss\. These methods condition on antigen through spatial message passing and achieve the highest sequence recovery among all paradigms\. RefineGNN\(Jinet al\.,[2022b](https://arxiv.org/html/2605.21610#bib.bib131)\)generates CDRs autoregressively without any antigen input, yet achieves the best binding metrics among all baselines\. Multiple independent studies corroborate this conditioning failure: BLOSUM substitution scores predict model outputs as accurately as learned likelihoods\(Uçar and Sormanni,[2025](https://arxiv.org/html/2605.21610#bib.bib282)\), removing the antigen chain leaves predictions nearly unchanged\(Liet al\.,[2025](https://arxiv.org/html/2605.21610#bib.bib332)\), and a BLOSUM lookup table outperforms deep learning methods in binder enrichment\(Chineryet al\.,[2024](https://arxiv.org/html/2605.21610#bib.bib334)\)\. While these studies document the problem, they fall short of identifying the root causes and proposing methodological interventions\.

#### Diffusion, flow, and ODE methods\.

DiffAb\(Luoet al\.,[2022](https://arxiv.org/html/2605.21610#bib.bib250)\)models CDR generation as diffusion on SE\(3\)×\\timescategorical, while AbFlowNet\(Abiret al\.,[2025](https://arxiv.org/html/2605.21610#bib.bib257)\)replaces diffusion with GFlowNet trajectory balance\. AbMEGD\(Chenet al\.,[2025](https://arxiv.org/html/2605.21610#bib.bib259)\)adds multi\-scale encoding, dyAb\(Tanet al\.,[2025](https://arxiv.org/html/2605.21610#bib.bib252)\)applies flow matching, AbODE\(Vermaet al\.,[2023](https://arxiv.org/html/2605.21610#bib.bib129)\)uses conjoined ODEs, and RADAb\(Wanget al\.,[2024](https://arxiv.org/html/2605.21610#bib.bib123)\)augments diffusion with retrieval\. FlowDesign\(Wuet al\.,[2025a](https://arxiv.org/html/2605.21610#bib.bib301)\)follows a diagnose\-then\-fix approach similar to ours\. These sampling\-based methods maintain higher amino acid diversity but achieve substantially lower sequence recovery than GNN methods\.

#### Multi\-modal sequence prediction and conditioning\.

TERMinator\(Liet al\.,[2023](https://arxiv.org/html/2605.21610#bib.bib330)\)derives Potts energy tables from a GNN encoder for MCMC\-based sequence design\. PottsMPNN\(Birnbaum and Keating,[2026](https://arxiv.org/html/2605.21610#bib.bib331)\)adds pairwise Potts supervision to ProteinMPNN, improving thermodynamic stability beyond what native sequence recovery captures\. Mixture density networks\(Bishop,[1994](https://arxiv.org/html/2605.21610#bib.bib329)\)model multi\-modal distributions but have not been applied to discrete amino acid prediction\. Annealed multiple choice learning\(Pereraet al\.,[2024](https://arxiv.org/html/2605.21610#bib.bib325)\)addresses winner\-take\-all collapse in hypothesis ensembles but has not been applied to biological sequences\. Our MDN\-Potts head combines these: mixture components with learned coupling matrices are decoded by belief propagation and trained via aMCL end\-to\-end\. For conditioning, classifier\-free guidance\(Ho and Salimans,[2022](https://arxiv.org/html/2605.21610#bib.bib335)\)drops the conditioning signal during training to enforce genuine conditional distributions, but has not been used for antigen specificity\. RAAD’s contrastive loss\(Wuet al\.,[2025b](https://arxiv.org/html/2605.21610#bib.bib199)\)operates on embeddings rather than decoded sequences\. Our antigen classification loss operates directly on predicted sequence distributions, routing gradients through the decoder\.

## 3Preliminaries

### 3\.1Task Definition

We adopt the formulation fromChimera\-Bench\(Ahmedet al\.,[2026](https://arxiv.org/html/2605.21610#bib.bib1)\)\. Given an antigen structureA=\{\(sj,𝐱j\)∣j∈VA\}A=\\\{\(s\_\{j\},\\mathbf\{x\}\_\{j\}\)\\mid j\\in V\_\{A\}\\\}, an epitope specificationE⊆VAE\\subseteq V\_\{A\}, and an antibody frameworkF=\{\(si,𝐱i\)∣i∈VFR\}F=\\\{\(s\_\{i\},\\mathbf\{x\}\_\{i\}\)\\mid i\\in V\_\{\\text\{FR\}\}\\\}, the task is to design CDR residuesR=\{\(sk,𝐱k\)∣k∈VCDR\}R=\\\{\(s\_\{k\},\\mathbf\{x\}\_\{k\}\)\\mid k\\in V\_\{\\text\{CDR\}\}\\\}that maximize the conditional likelihood while satisfying epitope contact constraints:

R∗=arg​maxR⁡pθ​\(R∣A,E,F\),s\.t\.​𝒞​\(R,A\)⊆E,𝒞​\(R,A\)≠∅R^\{\*\}=\\operatorname\*\{arg\\,max\}\_\{R\}\\;p\_\{\\theta\}\\\!\\bigl\(R\\mid A,E,F\\bigr\),\\quad\\text\{s\.t\.\}\\;\\;\\mathcal\{C\}\(R,A\)\\subseteq E,\\;\\;\\mathcal\{C\}\(R,A\)\\neq\\emptyset\(1\)where each residue is represented by its amino acid typesk∈\{1,…,20\}s\_\{k\}\\in\\\{1,\\ldots,20\\\}and Cα\\alphacoordinate𝐱k∈ℝ3\\mathbf\{x\}\_\{k\}\\in\\mathbb\{R\}^\{3\}, and𝒞​\(R,A\)=\{j∈VA∣∃k∈VCDR:‖𝐱k−𝐱j‖<dc\}\\mathcal\{C\}\(R,A\)=\\\{j\\in V\_\{A\}\\mid\\exists\\,k\\in V\_\{\\text\{CDR\}\}:\\\|\\mathbf\{x\}\_\{k\}\-\\mathbf\{x\}\_\{j\}\\\|<d\_\{c\}\\\}is the set of antigen residues contacted by the designed CDRs within cutoffdcd\_\{c\}\. We focus on CDR\-H3 as the most variable loop and primary determinant of antigen specificity\(Chothia and Lesk,[1987](https://arxiv.org/html/2605.21610#bib.bib276)\)\.

### 3\.2Notation and Graph Construction

The antibody\-antigen complex is represented as a heterogeneous graph𝒢=\(V,ℰ\)\\mathcal\{G\}=\(V,\\mathcal\{E\}\)whereV=VHC∪VLC∪VA∪Vglob∪VvnV=V\_\{\\text\{HC\}\}\\cup V\_\{\\text\{LC\}\}\\cup V\_\{A\}\\cup V\_\{\\text\{glob\}\}\\cup V\_\{\\text\{vn\}\}contains residue nodes from the heavy chain, light chain, and antigen, three global delimiter tokens, andNvn=3N\_\{\\text\{vn\}\}=3virtual nodes\(Sestaket al\.,[2026](https://arxiv.org/html/2605.21610#bib.bib220)\)\. Each residue nodeiicarries an amino acid typesi∈\{1,…,20\}s\_\{i\}\\in\\\{1,\\ldots,20\\\}and four backbone atom coordinates𝐗i∈ℝ4×3\\mathbf\{X\}\_\{i\}\\in\\mathbb\{R\}^\{4\\times 3\}\. The edge setℰ\\mathcal\{E\}is partitioned into 10 typed subsets covering intra\-chain \(radial, sequential, KNN\), inter\-chain \(radial, KNN\), global\-to\-chain, and virtual\-node\-to\-epitope/CDR connectivity \(Table[3](https://arxiv.org/html/2605.21610#A1.T3)in Appendix[A\.2](https://arxiv.org/html/2605.21610#A1.SS2)\)\. Each edge carries a 104\-dimensional feature vector encoding edge type, relative position, pairwise distance RBFs, quaternion orientation, and local frame directions\.

### 3\.3Failure Modes

We evaluate eleven CDR\-H3 design methods on theChimera\-Benchbenchmark \(epitope\-group split, 2338/292/292 train/val/test\)\. All methods are retrained onChimera\-Benchusing the authors’ released code with default hyperparameters\. We organize the diagnostics around three failure modes that affect the strongest paradigm for CDR design, equivariant GNN methods with greedy decoding\.

#### Antigen Blindness

If a model conditions on the antigen, it should produce distinct sequences for distinct antigens\. We measure this via unique sequence fraction and per\-position entropy ratio relative to ground truth diversity\. dyMEAN produces only 20\.9% unique sequences \(entropy ratio 0\.40\), meaning the same CDR\-H3 appears for 79% of antigens\. RAAD and MEAN produce 76\.4% and 66\.8% unique sequences with entropy ratios of 0\.48 and 0\.45\. All GNN methods overrepresent glycine and tyrosine by 2\.5–2\.9×\\times\. Sampling\-based methods \(DiffAb, RefineGNN\) achieve near\-native diversity but at the cost of much lower sequence recovery \(AAR 0\.20–0\.23\)\. These patterns indicate that GNN methods converge toward a small set of average CDR sequences largely independent of the presented antigen\.

#### Vocabulary Collapse

We measure predicted amino acid diversity via the effective vocabularyEV=exp⁡\(−∑ap​\(a\)​log⁡p​\(a\)\)\\text\{EV\}=\\exp\\\!\\bigl\(\-\\sum\_\{a\}p\(a\)\\log p\(a\)\\bigr\)\. Native CDR\-H3 sequences exhibit EV≈\\approx15\.5\. GNN methods with greedy decoding collapse to EV 3\.0–5\.5: RAAD overrepresents glycine by 21pp and tyrosine by 18pp, while rare interface\-critical residues \(tryptophan, cysteine, methionine\) are predicted at near\-zero frequency\. Motif diversity is correspondingly impoverished \(13–52 unique bigrams versus 364 in native sequences\)\. Sampling\-based methods approach native diversity \(EV 11\.7–14\.9\) but at the cost of accuracy \(AAR 0\.20–0\.23\)\.

#### Cross\-Entropy Ceiling

Standard CDR design methods minimize per\-position cross\-entropyℒCE=−∑ilog⁡pθ​\(yi∣𝐱\)\\mathcal\{L\}\_\{\\text\{CE\}\}=\-\\sum\_\{i\}\\log p\_\{\\theta\}\(y\_\{i\}\\mid\\mathbf\{x\}\)\. We prove that the optimal predictor under this objective is the positional marginal, independent of any conditioning signal\.

###### Proposition 1\(Cross\-Entropy Ceiling\)\.

For any modelpθ​\(si∣A,context\)p\_\{\\theta\}\(s\_\{i\}\\mid A,\\mathrm\{context\}\)trained to minimize per\-position cross\-entropy, the optimal solution satisfiespθ∗​\(a∣i\)=p¯i​\(a\)p\_\{\\theta\}^\{\*\}\(a\\mid i\)=\\bar\{p\}\_\{i\}\(a\), the empirical positional marginal, regardless of the conditioning context\.

The proof \(Appendix[A\.1](https://arxiv.org/html/2605.21610#A1.SS1)\) follows from the KL decomposition of cross\-entropy:𝔼​\[−log⁡pθ​\(si\)\]=H​\(p¯i\)\+DKL​\(p¯i∥pθ\)\\mathbb\{E\}\[\-\\log p\_\{\\theta\}\(s\_\{i\}\)\]=H\(\\bar\{p\}\_\{i\}\)\+D\_\{\\mathrm\{KL\}\}\(\\bar\{p\}\_\{i\}\\\|p\_\{\\theta\}\), minimized whenpθ=p¯ip\_\{\\theta\}=\\bar\{p\}\_\{i\}\.

###### Corollary 2\(Vocabulary Collapse under Greedy Decoding\)\.

At the CE optimum with argmax decoding, the effective vocabulary is bounded by the number of distinct positional modes, which for CDR\-H3 concentrates on three to five amino acids\.

We verify this empirically: RAAD, MEAN, and dyMEAN all converge to AAR 0\.35–0\.37, matching a zero\-parameter positional mode lookup table\. Their substitution patterns correlate atr=0\.69r=0\.69–0\.720\.72with the positional frequency table \(PWM\), confirming that predictions track positional amino acid frequencies rather than antigen\-specific preferences\.

## 4Method

Figure[2](https://arxiv.org/html/2605.21610#S4.F2)shows the overall pipeline\. The encoder corrupts heavy\-chain framework embeddings via dropout, encodes the complex with a VirtualNode\-EGNN, and computes CDR\-to\-epitope attention on the Lorentz hyperboloid\. The decoder is an MDN\-Potts head trained via annealed multiple choice learning\. We detail each component below; training and inference algorithms are in Appendix[A\.3](https://arxiv.org/html/2605.21610#A1.SS3)\.

![[Uncaptioned image]](https://arxiv.org/html/2605.21610v1/figures/ag_force_overall.drawio.png)

Figure 2:Overall pipeline ofAgForce\.
![Refer to caption](https://arxiv.org/html/2605.21610v1/figures/ag_force_encoder_decoder.drawio.png)Figure 3:AgForcearchitecture\.Encoder: Framework dropout corrupts heavy\-chain framework embeddings before the VirtualNode\-EGNN encodes the complex\. Hyperbolic cross\-attention combines CDR and epitope embeddings with projected ESM\-2 features\.Decoder: The MDN\-Potts head predicts amino acid distributions throughK=4K=4mixture components with pairwise coupling, trained via aMCL\. Antigen classification loss routes gradients through the predicted CDR probabilities back to the antigen embedding space\.### 4\.1Feature Encoding and Framework Dropout

Each residue in the antibody\-antigen complex is represented by a 108\-dimensional feature vector encoding backbone geometry \(sinusoidal position embeddings, bond distance RBFs, dihedral angles, local frame directions\), amino acid identity \(masked to zero for CDR positions during training\), interface complementarity features, and a segment type embedding \(full breakdown in Appendix[A\.2](https://arxiv.org/html/2605.21610#A1.SS2)\)\. A dual\-path MLP processes geometric and chemical features through separate pathways with SiLU activations, fuses them, and projects to embedding dimensiond=128d=128\. Epitope residues receive an additional learnable embedding\.

To address antigen blindness \(Section[3\.3](https://arxiv.org/html/2605.21610#S3.SS3.SSS0.Px1)\), we apply framework dropout before the GNN\. During training, each heavy\-chain framework residue embedding is independently set to zero with probabilityp=0\.3p=0\.3:

𝐡itrain=\{𝟎if​i∈VFRHC​and​mi=1𝐡iotherwise,mi∼Bernoulli​\(p\)\\mathbf\{h\}\_\{i\}^\{\\text\{train\}\}=\\begin\{cases\}\\mathbf\{0\}&\\text\{if \}i\\in V\_\{\\text\{FR\}\}^\{\\text\{HC\}\}\\text\{ and \}m\_\{i\}=1\\\\ \\mathbf\{h\}\_\{i\}&\\text\{otherwise\}\\end\{cases\},\\quad m\_\{i\}\\sim\\text\{Bernoulli\}\(p\)\(2\)Because the corruption is applied*before*message passing, its effect propagates through all subsequent GNN layers, forcing the model to extract information from the antigen pathway rather than relying on the antibody framework shortcut\.

### 4\.2VirtualNode\-EGNN

Operating on the graph𝒢\\mathcal\{G\}\(Section[3\.2](https://arxiv.org/html/2605.21610#S3.SS2)\),Nvn=3N\_\{\\text\{vn\}\}=3virtual nodes with learnable features and coordinates create a two\-hop information channel between epitope and CDR residues, addressing the over\-squashing problem\(Alon and Yahav,[2021](https://arxiv.org/html/2605.21610#bib.bib326)\)where information from distant epitope residues dilutes through sequential message passing\.

The GNN consists of 5 relation\-aware EGNN layers\(Satorraset al\.,[2021](https://arxiv.org/html/2605.21610#bib.bib179); Wuet al\.,[2025b](https://arxiv.org/html/2605.21610#bib.bib199)\)\. Each layer updates node features and coordinates simultaneously\. For edge\(i,j\)\(i,j\)of typett, the message is:

𝐦i​j=MLPmsg​\(\[𝐡i,𝐡j,vec​\(Δ​𝐱i​Δ​𝐱j⊤\),𝐞i​j\]\)\\mathbf\{m\}\_\{ij\}=\\text\{MLP\}\_\{\\text\{msg\}\}\\\!\\left\(\[\\mathbf\{h\}\_\{i\},\\;\\mathbf\{h\}\_\{j\},\\;\\text\{vec\}\(\\Delta\\mathbf\{x\}\_\{i\}\\Delta\\mathbf\{x\}\_\{j\}^\{\\top\}\),\\;\\mathbf\{e\}\_\{ij\}\]\\right\)\(3\)whereΔ​𝐱i=𝐱i−𝐱j\\Delta\\mathbf\{x\}\_\{i\}=\\mathbf\{x\}\_\{i\}\-\\mathbf\{x\}\_\{j\}andvec​\(⋅\)\\text\{vec\}\(\\cdot\)flattens the outer product of coordinate differences into a radial feature\. The Gram matrix entries are dot products of displacement vectors, which are E\(3\)\-invariant\. Node features are updated via per\-type linear aggregation:

𝐡i′=𝐡i\+MLPnode​\(\[𝐡i,∑t=09𝐖t​∑j∈𝒩t​\(i\)𝐦i​j\]\)\\mathbf\{h\}\_\{i\}^\{\\prime\}=\\mathbf\{h\}\_\{i\}\+\\text\{MLP\}\_\{\\text\{node\}\}\\\!\\left\(\\left\[\\mathbf\{h\}\_\{i\},\\;\\sum\_\{t=0\}^\{9\}\\mathbf\{W\}\_\{t\}\\sum\_\{j\\in\\mathcal\{N\}\_\{t\}\(i\)\}\\mathbf\{m\}\_\{ij\}\\right\]\\right\)\(4\)where𝐖t\\mathbf\{W\}\_\{t\}is a type\-specific projection matrix and𝒩t​\(i\)\\mathcal\{N\}\_\{t\}\(i\)is the set of neighbors under edge typett\. Coordinates are updated equivariantly:

𝐱i′=𝐱i\+∑tmeanj∈𝒩t​\(i\)​\(Δ​𝐱i​j⊙MLPtcoord​\(𝐦i​j\)\)\\mathbf\{x\}\_\{i\}^\{\\prime\}=\\mathbf\{x\}\_\{i\}\+\\sum\_\{t\}\\text\{mean\}\_\{j\\in\\mathcal\{N\}\_\{t\}\(i\)\}\\\!\\left\(\\Delta\\mathbf\{x\}\_\{ij\}\\odot\\text\{MLP\}\_\{t\}^\{\\text\{coord\}\}\(\\mathbf\{m\}\_\{ij\}\)\\right\)\(5\)TheΔ​𝐱⊙scalar\\Delta\\mathbf\{x\}\\odot\\text\{scalar\}structure ensures E\(3\)\-equivariance by construction\. After 5 layers, the GNN produces residue embeddings𝐡∈ℝN×256\\mathbf\{h\}\\in\\mathbb\{R\}^\{N\\times 256\}and updated coordinates𝐙∈ℝN×4×3\\mathbf\{Z\}\\in\\mathbb\{R\}^\{N\\times 4\\times 3\}\.

###### Proposition 3\(E\(3\)\-Equivariance ofAgForce\)\.

Coordinate predictions are E\(3\)\-equivariant and sequence predictions are E\(3\)\-invariant \(proof in Appendix[A\.1](https://arxiv.org/html/2605.21610#A1.SS1)\)\.

### 4\.3Hyperbolic Cross\-Attention

Rather than standard Euclidean attention,AgForceperforms CDR\-to\-epitope cross\-attention on the Lorentz hyperboloid\(Nickel and Kiela,[2017](https://arxiv.org/html/2605.21610#bib.bib327)\)\. Hyperbolic spaces represent hierarchical relationships naturally because their volume grows exponentially with distance, matching the structure of antibody\-antigen binding where global epitope geometry constrains local amino acid choices\. Given CDR embeddings𝐡cdr∈ℝL×D\\mathbf\{h\}\_\{\\text\{cdr\}\}\\in\\mathbb\{R\}^\{L\\times D\}and epitope embeddings𝐡epi∈ℝE×D\\mathbf\{h\}\_\{\\text\{epi\}\}\\in\\mathbb\{R\}^\{E\\times D\}, queries and keys are projected intoH=4H=4heads and mapped onto the hyperboloid by computingx0=1/c\+‖𝐱1:‖2x\_\{0\}=\\sqrt\{1/c\+\\\|\\mathbf\{x\}\_\{1:\}\\\|^\{2\}\}with curvaturec=1\.0c=1\.0\. Attention scores use pairwise hyperbolic distances:

dℋ​\(𝐪,𝐤\)\\displaystyle d\_\{\\mathcal\{H\}\}\(\\mathbf\{q\},\\mathbf\{k\}\)=1c​arccosh⁡\(−c​⟨𝐪,𝐤⟩ℒ\)\\displaystyle=\\frac\{1\}\{\\sqrt\{c\}\}\\operatorname\{arccosh\}\\\!\\bigl\(\-c\\langle\\mathbf\{q\},\\mathbf\{k\}\\rangle\_\{\\mathcal\{L\}\}\\bigr\)\(6\)αi​j\\displaystyle\\alpha\_\{ij\}=softmaxj⁡\(−dℋ​\(𝐪i,𝐤j\)D/H\)\\displaystyle=\\operatorname\{softmax\}\_\{j\}\\\!\\left\(\\frac\{\-d\_\{\\mathcal\{H\}\}\(\\mathbf\{q\}\_\{i\},\\mathbf\{k\}\_\{j\}\)\}\{\\sqrt\{D/H\}\}\\right\)\(7\)where⟨⋅,⋅⟩ℒ\\langle\\cdot,\\cdot\\rangle\_\{\\mathcal\{L\}\}denotes the Minkowski inner product⟨𝐪,𝐤⟩ℒ=−q0​k0\+∑d=1D/Hqd​kd\\langle\\mathbf\{q\},\\mathbf\{k\}\\rangle\_\{\\mathcal\{L\}\}=\-q\_\{0\}k\_\{0\}\+\\sum\_\{d=1\}^\{D/H\}q\_\{d\}k\_\{d\}\. The output is aggregated in Euclidean space via a weighted sum of value vectors\.

#### Antigen gated bottleneck\.

The attention output𝐨i\\mathbf\{o\}\_\{i\}passes through a gated bottleneck before reaching the sequence head:

𝐠i=σ​\(𝐖g​𝐨i\+𝐛g\),𝐡~i=α⋅\(𝐡icdr⊙𝐠i\)\+\(1−α\)⋅𝐡icdr\\mathbf\{g\}\_\{i\}=\\sigma\(\\mathbf\{W\}\_\{g\}\\mathbf\{o\}\_\{i\}\+\\mathbf\{b\}\_\{g\}\),\\quad\\tilde\{\\mathbf\{h\}\}\_\{i\}=\\alpha\\cdot\(\\mathbf\{h\}\_\{i\}^\{\\text\{cdr\}\}\\odot\\mathbf\{g\}\_\{i\}\)\+\(1\-\\alpha\)\\cdot\\mathbf\{h\}\_\{i\}^\{\\text\{cdr\}\}\(8\)whereα=σ​\(αlogit\)⋅2​α0\\alpha=\\sigma\(\\alpha\_\{\\text\{logit\}\}\)\\cdot 2\\alpha\_\{0\}is a learnable mixing coefficient initialized nearα0=0\.5\\alpha\_\{0\}=0\.5\. The output is the concatenation\[𝐡~i,𝐨i\]∈ℝ512\[\\tilde\{\\mathbf\{h\}\}\_\{i\},\\;\\mathbf\{o\}\_\{i\}\]\\in\\mathbb\{R\}^\{512\}\. When ESM\-2 embeddings are available, they are projected from 1280 to 256 dimensions and concatenated, yielding a 768\-dimensional input to the sequence head\.

### 4\.4Mixture Density Network with Pairwise Coupling

To break the cross\-entropy ceiling \(Section[3\.3](https://arxiv.org/html/2605.21610#S3.SS3.SSS0.Px3)\),AgForcereplaces the linear sequence head with a mixture density network\. A shared layer applies layer normalization, a linear projection from 768 to 384 dimensions, SiLU activation, and dropout\. From this,K=4K=4component heads each produce logitsℓk∈ℝL×20\\boldsymbol\{\\ell\}\_\{k\}\\in\\mathbb\{R\}^\{L\\times 20\}, and a mixing head produces per\-position weights𝝅∈ℝL×K\\boldsymbol\{\\pi\}\\in\\mathbb\{R\}^\{L\\times K\}via softmax\.

Each componentkkis equipped with a symmetric coupling matrix𝐉k∈ℝ20×20\\mathbf\{J\}\_\{k\}\\in\\mathbb\{R\}^\{20\\times 20\}\(Markset al\.,[2011](https://arxiv.org/html/2605.21610#bib.bib328)\)that captures pairwise amino acid preferences between adjacent positions\. The component logits are refined through two rounds of belief propagation:

𝐦i→i\+1\\displaystyle\\mathbf\{m\}\_\{i\\to i\+1\}=𝒃i⋅𝐉k\+𝐉k⊤2,𝐦i\+1→i=𝒃i\+1⋅𝐉k\+𝐉k⊤2\\displaystyle=\\boldsymbol\{b\}\_\{i\}\\cdot\\frac\{\\mathbf\{J\}\_\{k\}\+\\mathbf\{J\}\_\{k\}^\{\\top\}\}\{2\},\\quad\\mathbf\{m\}\_\{i\+1\\to i\}=\\boldsymbol\{b\}\_\{i\+1\}\\cdot\\frac\{\\mathbf\{J\}\_\{k\}\+\\mathbf\{J\}\_\{k\}^\{\\top\}\}\{2\}\(9\)ℓk\(i\)\\displaystyle\\boldsymbol\{\\ell\}\_\{k\}^\{\(i\)\}←ℓk\(i\)\+gi⋅\(𝐦i−1→i\+𝐦i\+1→i\)\\displaystyle\\leftarrow\\boldsymbol\{\\ell\}\_\{k\}^\{\(i\)\}\+g\_\{i\}\\cdot\\bigl\(\\mathbf\{m\}\_\{i\-1\\to i\}\+\\mathbf\{m\}\_\{i\+1\\to i\}\\bigr\)\(10\)where𝒃i=softmax​\(ℓk\(i\)\)\\boldsymbol\{b\}\_\{i\}=\\text\{softmax\}\(\\boldsymbol\{\\ell\}\_\{k\}^\{\(i\)\}\)are soft beliefs andgi=σ​\(MLP​\(𝐡i\)\)g\_\{i\}=\\sigma\(\\text\{MLP\}\(\\mathbf\{h\}\_\{i\}\)\)is a learned gate\. After two rounds, the coupling matrices learn which amino acid pairs are compatible at adjacent CDR positions\.

#### Annealed Multiple Choice Learning \(aMCL\)\.

Rather than standard mixture likelihood, we adopt aMCL\(Pereraet al\.,[2024](https://arxiv.org/html/2605.21610#bib.bib325)\)\. The loss for each component combines cross\-entropy and pairwise energy:

ℓk=ℒCE\(k\)\+λpair⋅Epair\(k\),Epair\(k\)=1L−1​∑i=1L−1𝒃i⊤​𝐉k\+𝐉k⊤2​𝒃i\+1\\ell\_\{k\}=\\mathcal\{L\}\_\{\\text\{CE\}\}^\{\(k\)\}\+\\lambda\_\{\\text\{pair\}\}\\cdot E\_\{\\text\{pair\}\}^\{\(k\)\},\\quad E\_\{\\text\{pair\}\}^\{\(k\)\}=\\frac\{1\}\{L\{\-\}1\}\\sum\_\{i=1\}^\{L\-1\}\\boldsymbol\{b\}\_\{i\}^\{\\top\}\\frac\{\\mathbf\{J\}\_\{k\}\+\\mathbf\{J\}\_\{k\}^\{\\top\}\}\{2\}\\boldsymbol\{b\}\_\{i\+1\}\(11\)Components are assigned soft Boltzmann weights based on their losses:

wk=exp⁡\(−ℓk/τ\)∑k′exp⁡\(−ℓk′/τ\),ℒseq=∑k=1Kwk⋅ℓkw\_\{k\}=\\frac\{\\exp\(\-\\ell\_\{k\}/\\tau\)\}\{\\sum\_\{k^\{\\prime\}\}\\exp\(\-\\ell\_\{k^\{\\prime\}\}/\\tau\)\},\\quad\\mathcal\{L\}\_\{\\text\{seq\}\}=\\sum\_\{k=1\}^\{K\}w\_\{k\}\\cdot\\ell\_\{k\}\(12\)Temperatureτ\\tauanneals fromτstart=2\.0\\tau\_\{\\text\{start\}\}=2\.0toτend=0\.1\\tau\_\{\\text\{end\}\}=0\.1over 20 epochs\. At high temperature, all components receive roughly equal gradient and explore\. Asτ\\taudecreases, the assignment sharpens toward winner\-take\-all, encouraging specialization\.

###### Proposition 4\(MDN\-aMCL Breaks the Cross\-Entropy Ceiling\)\.

The aMCL objective admits optimal solutions where individual component predictionspk∗​\(a∣i\)p\_\{k\}^\{\*\}\(a\\mid i\)differ from the positional marginalp¯i​\(a\)\\bar\{p\}\_\{i\}\(a\)\. The Boltzmann assignment partitions training examples among components, and each component’s optimum is the conditional marginal over its assigned subset \(proof in Appendix[A\.1](https://arxiv.org/html/2605.21610#A1.SS1)\)\.

#### GDPP diversity regularization\.

We regularize the mixture with a Generative Determinantal Point Process loss\(Elfekiet al\.,[2019](https://arxiv.org/html/2605.21610#bib.bib319)\)that matches eigenvalue spectra of kernel matrices from predicted and ground truth distributions:

ℒGDPP=‖eigenvalues​\(𝐊pred\)−eigenvalues​\(𝐊true\)‖22\\mathcal\{L\}\_\{\\text\{GDPP\}\}=\\left\\\|\\text\{eigenvalues\}\\\!\\left\(\\mathbf\{K\}\_\{\\text\{pred\}\}\\right\)\-\\text\{eigenvalues\}\\\!\\left\(\\mathbf\{K\}\_\{\\text\{true\}\}\\right\)\\right\\\|\_\{2\}^\{2\}\(13\)where𝐊pred=𝐏𝐏⊤\+ϵ​𝐈\\mathbf\{K\}\_\{\\text\{pred\}\}=\\mathbf\{P\}\\mathbf\{P\}^\{\\top\}\+\\epsilon\\mathbf\{I\}with𝐏\\mathbf\{P\}the softmax probabilities from the active component\. This directly counteracts vocabulary collapse \(Section[3\.3](https://arxiv.org/html/2605.21610#S3.SS3.SSS0.Px2)\)\.

### 4\.5Antigen Classification Loss

This loss forces the predicted CDR probability distributions to encode antigen\-specific information\. Given a batch ofBBcomplexes, let𝐩i∈ℝLi×20\\mathbf\{p\}\_\{i\}\\in\\mathbb\{R\}^\{L\_\{i\}\\times 20\}denote the softmax probabilities from the sequence head for complexii, and let𝐚i∈ℝD\\mathbf\{a\}\_\{i\}\\in\\mathbb\{R\}^\{D\}denote the mean\-pooled antigen embedding from the GNN\. The classification head first computes a differentiable “soft sequence” embedding:

𝐜i=MLP​\(1Li​∑j=1Li𝐩i​\[j\]⋅𝐄AA\)\\mathbf\{c\}\_\{i\}=\\text\{MLP\}\\\!\\left\(\\frac\{1\}\{L\_\{i\}\}\\sum\_\{j=1\}^\{L\_\{i\}\}\\mathbf\{p\}\_\{i\}\[j\]\\cdot\\mathbf\{E\}\_\{\\text\{AA\}\}\\right\)\(14\)where𝐄AA∈ℝ20×D\\mathbf\{E\}\_\{\\text\{AA\}\}\\in\\mathbb\{R\}^\{20\\times D\}are learnable amino acid embeddings\. The loss is an InfoNCE objective over the batch:

ℒcls=−1B​∑i=1Blog⁡exp⁡\(𝐜i⊤​𝐚i\)∑k=1Bexp⁡\(𝐜i⊤​𝐚k\)\\mathcal\{L\}\_\{\\text\{cls\}\}=\-\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\log\\frac\{\\exp\(\\mathbf\{c\}\_\{i\}^\{\\top\}\\mathbf\{a\}\_\{i\}\)\}\{\\sum\_\{k=1\}^\{B\}\\exp\(\\mathbf\{c\}\_\{i\}^\{\\top\}\\mathbf\{a\}\_\{k\}\)\}\(15\)If the model ignores the antigen and predicts identical CDRs for all complexes,𝐜i≈𝐜j\\mathbf\{c\}\_\{i\}\\approx\\mathbf\{c\}\_\{j\}and classification collapses to1/B1/B\. The key property is that gradients flow through the softmax output of the sequence head and back through the entire decoder pathway\. This distinguishes it from embedding\-level contrastive losses where the gradient bypasses the sequence decoder entirely\.

### 4\.6Training Objective

The full training objective combines five loss terms spanning structure prediction, sequence prediction, interface geometry, sequence diversity, and antigen forcing:

ℒ=ℒseq\+α​ℒcoord\+δ​ℒshadow\+ϵ​ℒGDPP\+λcls​ℒcls\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{seq\}\}\+\\alpha\\mathcal\{L\}\_\{\\text\{coord\}\}\+\\delta\\mathcal\{L\}\_\{\\text\{shadow\}\}\+\\epsilon\\mathcal\{L\}\_\{\\text\{GDPP\}\}\+\\lambda\_\{\\text\{cls\}\}\\mathcal\{L\}\_\{\\text\{cls\}\}\(16\)The coordinate lossℒcoord\\mathcal\{L\}\_\{\\text\{coord\}\}is a smooth\-ℓ1\\ell\_\{1\}\(Huber\) loss on predicted versus true Cα\\alphabackbone coordinates for CDR positions:

ℒcoord=∑k∈VCDRsmoothℓ1​\(𝐙k−𝐱ktrue\)\\mathcal\{L\}\_\{\\text\{coord\}\}=\\sum\_\{k\\in V\_\{\\text\{CDR\}\}\}\\text\{smooth\}\_\{\\ell\_\{1\}\}\\\!\\left\(\\mathbf\{Z\}\_\{k\}\-\\mathbf\{x\}\_\{k\}^\{\\text\{true\}\}\\right\)\(17\)The shadow paratope lossℒshadow\\mathcal\{L\}\_\{\\text\{shadow\}\}\(Konget al\.,[2023b](https://arxiv.org/html/2605.21610#bib.bib249)\)enforces that the predicted CDR backbone maintains correct distances to the epitope surface:

ℒshadow=1\|ℰepi\|⋅L​∑j∈ℰepi∑k∈VCDR\|‖𝐱^k−𝐱j‖−‖𝐱ktrue−𝐱j‖\|\\mathcal\{L\}\_\{\\text\{shadow\}\}=\\frac\{1\}\{\|\\mathcal\{E\}\_\{\\text\{epi\}\}\|\\cdot L\}\\sum\_\{j\\in\\mathcal\{E\}\_\{\\text\{epi\}\}\}\\sum\_\{k\\in V\_\{\\text\{CDR\}\}\}\\left\|\\\|\\hat\{\\mathbf\{x\}\}\_\{k\}\-\\mathbf\{x\}\_\{j\}\\\|\-\\\|\\mathbf\{x\}\_\{k\}^\{\\text\{true\}\}\-\\mathbf\{x\}\_\{j\}\\\|\\right\|\(18\)whereℰepi\\mathcal\{E\}\_\{\\text\{epi\}\}is the set of epitope residues within 8 Å of any CDR residue\. Loss weights areα=1\.3\\alpha=1\.3,δ=0\.664\\delta=0\.664,ϵ=0\.05\\epsilon=0\.05, andλcls=0\.2\\lambda\_\{\\text\{cls\}\}=0\.2, determined by hyperparameter sweep\. At inference, greedy decoding selects the highest\-weight mixture component per position and the argmax amino acid\. Full training and inference algorithms are in Appendix[A\.3](https://arxiv.org/html/2605.21610#A1.SS3)\.

## 5Experiments

### 5\.1Setup

#### Dataset and metrics\.

We evaluate onChimera\-Bench\(Ahmedet al\.,[2026](https://arxiv.org/html/2605.21610#bib.bib1)\), comprising 2,922 antibody\-antigen complexes\. We report CDR\-H3 on the epitope\-group split \(2338/292/292 train/val/test\), the most challenging setting\. The sequence quality is measured by amino acid recovery \(AAR\), contact AAR \(CAAR, restricted to positions within 8 Å of the antigen\), and perplexity \(PPL\)\. Whereas, the structure quality is measured by Cα\\alphaRMSD\. Binding quality is measured by fnat, iRMSD, DockQ\(Basu and Wallner,[2016](https://arxiv.org/html/2605.21610#bib.bib270)\), and epitope F1\. All interface metrics use symmetric Cα\\alpha–Cα\\alphacontacts at 8 Å restricted to CDR residues\.

#### Baselines\.

We compare against: equivariant GNNs \(RAAD\(Wuet al\.,[2025b](https://arxiv.org/html/2605.21610#bib.bib199)\), MEAN\(Konget al\.,[2023a](https://arxiv.org/html/2605.21610#bib.bib254)\), dyMEAN\(Konget al\.,[2023b](https://arxiv.org/html/2605.21610#bib.bib249)\)\), diffusion/flow models \(DiffAb\(Luoet al\.,[2022](https://arxiv.org/html/2605.21610#bib.bib250)\), AbFlowNet\(Abiret al\.,[2025](https://arxiv.org/html/2605.21610#bib.bib257)\), AbMEGD\(Chenet al\.,[2025](https://arxiv.org/html/2605.21610#bib.bib259)\), RADAb\(Wanget al\.,[2024](https://arxiv.org/html/2605.21610#bib.bib123)\), dyAb\(Tanet al\.,[2025](https://arxiv.org/html/2605.21610#bib.bib252)\)\), AbODE\(Vermaet al\.,[2023](https://arxiv.org/html/2605.21610#bib.bib129)\), RefineGNN\(Jinet al\.,[2022b](https://arxiv.org/html/2605.21610#bib.bib131)\), and AbDockGen\(Jinet al\.,[2022a](https://arxiv.org/html/2605.21610#bib.bib227)\)\. All models are retrained onChimera\-Benchwith their original hyperparameters\.

### 5\.2Main Results

Table 1:CDR\-H3 design onChimera\-Bench\(epitope\-group split, 292 test complexes\)\. Best inbold, second\-bestunderlined\.Table[1](https://arxiv.org/html/2605.21610#S5.T1)presents the main comparison\.AgForceachieves state\-of\-the\-art performance across all metric categories simultaneously, which is notable because prior methods typically trade off between sequence recovery and binding quality\. The three GNN baselines \(RAAD, MEAN, dyMEAN\) all converge to AAR = 0\.37, the positional marginal ceiling identified in Section[3\.3](https://arxiv.org/html/2605.21610#S3.SS3.SSS0.Px3)\.AgForcereaches AAR = 0\.40, exceeding this ceiling by 3 percentage points through the MDN\-Potts head with aMCL training, while also achieving the lowest perplexity \(2\.95\)\. On binding quality,AgForceachieves the highest fnat \(0\.67\), lowest iRMSD \(1\.30 Å\), best DockQ \(0\.74\), and best epitope F1 \(0\.77\), surpassing RefineGNN which was previously the strongest binder despite having no antigen input\. The lowest RMSD \(1\.60 Å\) confirms that structural quality is maintained at the interface\.

Contact AAR \(CAAR = 0\.21\) remains comparable to baselines, with gains coming primarily from non\-contact and anchor positions\. Predicting the exact amino acid at contact positions requires resolving the many\-to\-one mapping between sequences and binding modes, which remains difficult for all current methods\. Per\-CDR results \(Appendix[A\.5](https://arxiv.org/html/2605.21610#A1.SS5)\) show consistent gains across CDR\-H1, H2, and H3, and antigen conditioning analysis \(Appendix[A\.6](https://arxiv.org/html/2605.21610#A1.SS6)\) confirms thatAgForceproduces substantially more diverse, antigen\-specific sequences than GNN baselines \(95\.5% unique sequences versus 20\.9–76\.4%, with the highest interface enrichment correlation ofr=0\.78r=0\.78\)\.

#### Discussion\.

Two observations from Table[1](https://arxiv.org/html/2605.21610#S5.T1)deserve attention\. First, RefineGNN achieves the second\-best binding quality despite receiving no antigen input, confirming the diagnosis of Section[3\.3](https://arxiv.org/html/2605.21610#S3.SS3.SSS0.Px1): CDR backbone geometry alone is highly informative for interface contacts, and most existing conditioning mechanisms add little\.AgForcesurpasses RefineGNN precisely because its losses \(shadow paratope, antigen classification\) directly optimize interface geometry and antigen specificity rather than relying on message passing to implicitly propagate antigen information\. Second, the three GNN baselines converge to nearly identical AAR despite substantial architectural differences, consistent with the cross\-entropy ceiling of Proposition[1](https://arxiv.org/html/2605.21610#Thmproposition1)\. ThatAgForcebreaks this ceiling while using the same E\(3\)\-equivariant paradigm isolates the contribution of the MDN\-Potts head and aMCL training from the encoder architecture\.

### 5\.3Ablation Study

Table[2](https://arxiv.org/html/2605.21610#S5.T2)ablates the key components by removing one at a time from the full configuration\. Hyperbolic attention contributes the most to binding quality \(fnat−\-3\.7pp, DockQ−\-1\.7pp\), and framework dropout has a comparable binding effect \(fnat−\-4\.1pp\)\. The antigen classification loss primarily improves interface metrics rather than sequence recovery, acting as a regularizer that trades calibration for binding quality\. ESM\-2 removal produces the largest AAR drop \(−\-2\.5pp\), confirming that evolutionary context from the frozen PLM is essential\. Detailed per\-component discussion is in Appendix[A\.7](https://arxiv.org/html/2605.21610#A1.SS7)\.

Table 2:Ablation study on CDR\-H3 \(epitope\-group split, 292 test complexes\)\. Each row removes one or more components from the full configuration\.

## 6Conclusion

We showed that equivariant GNN methods for antibody CDR design share three causally linked failure modes: the cross\-entropy ceiling forces convergence to the positional marginal, which in turn causes antigen blindness and vocabulary collapse\.AgForceaddresses each failure with a targeted intervention and achieves state\-of\-the\-art performance across sequence, structure, and binding metrics simultaneously onChimera\-Bench\. Our results suggest that the choice of loss function and decoder head matters more than encoder architecture for conditional sequence design\.

#### Limitations and future work\.

Contact AAR remains comparable to baselines, indicating that predicting the exact amino acid at binding positions is a fundamentally harder problem that may require explicit modeling of side\-chain rotamers or physics\-based energy terms\. The effective vocabulary, while nearly doubled relative to GNN baselines, still falls short of native diversity\.

## References

- A\. R\. Abir, H\. S\. Shahgir, M\. R\. Z\. Ratul, M\. T\. Tahmid, G\. V\. Steeg, and Y\. Dong \(2025\)AbFlowNet: optimizing antibody\-antigen binding energy via diffusion\-gflownet fusion\.arXiv preprint arXiv:2505\.12358\.Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1),[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px2.p1.1)\.
- M\. Ahmed, N\. Taj, I\. U\. Khan, H\. Venkateswara, and M\. Patterson \(2026\)CHIMERA\-bench: a benchmark dataset for epitope\-specific antibody design\.InICLR 2026 Workshop on Generative and Experimental Perspectives for Biomolecular Design,Cited by:[§3\.1](https://arxiv.org/html/2605.21610#S3.SS1.p1.4),[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px1.p1.3)\.
- U\. Alon and E\. Yahav \(2021\)On the bottleneck of graph neural networks and its practical implications\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=i80OPhOCVH2)Cited by:[§4\.2](https://arxiv.org/html/2605.21610#S4.SS2.p1.2)\.
- S\. Basu and B\. Wallner \(2016\)DockQ: a quality measure for protein\-protein docking models\.PloS one11\(8\),pp\. e0161879\.Cited by:[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px1.p1.3)\.
- F\. Birnbaum and A\. E\. Keating \(2026\)Beyond native sequence recovery: improved modeling of the sequence\-energy landscape of protein structures\.bioRxiv,pp\. 2026–01\.Cited by:[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px3.p1.1)\.
- C\. M\. Bishop \(1994\)Mixture density networks\.Technical reportTechnical ReportNCRG/94/004,Aston University,Birmingham, UK\.Cited by:[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Chen, X\. Cai, J\. Wu, and W\. Hu \(2025\)Antibody design and optimization with multi\-scale equivariant graph diffusion models for accurate complex antigen binding\.arXiv preprint arXiv:2506\.20957\.Cited by:[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px2.p1.1)\.
- L\. Chinery, A\. M\. Hummer, B\. B\. Mehta, R\. Akbar, P\. Rawat, A\. Slabodkin, K\. Le Quy, F\. Lund\-Johansen, V\. Greiff, J\. R\. Jeliazkov, and C\. M\. Deane \(2024\)Simple computational methods can outperform deep learning in designing diverse, binder\-enriched antibody libraries\.bioRxiv\.External Links:[Document](https://dx.doi.org/10.1101/2024.03.26.586756)Cited by:[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Chothia and A\. M\. Lesk \(1987\)Canonical structures for the hypervariable regions of immunoglobulins\.Journal of molecular biology196\(4\),pp\. 901–917\.Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.21610#S3.SS1.p1.9)\.
- M\. Elfeki, C\. Couprie, M\. Riviere, and M\. Elhoseiny \(2019\)GDPP: learning diverse generations using determinantal point processes\.InInternational Conference on Machine Learning,Vol\.97,pp\. 1819–1828\.Cited by:[§4\.4](https://arxiv.org/html/2605.21610#S4.SS4.SSS0.Px2.p1.3)\.
- J\. Ho and T\. Salimans \(2022\)Classifier\-free diffusion guidance\.arXiv preprint arXiv:2207\.12598\.Cited by:[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Hummer, B\. Abanades, and C\. Deane \(2022\)Advances in computational structure\-based antibody design\.Current Opinion in Structural Biology74,pp\. 102379\.Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1)\.
- W\. Jin, R\. Barzilay, and T\. Jaakkola \(2022a\)Antibody\-antigen docking and design via hierarchical structure refinement\.InInternational Conference on Machine Learning,pp\. 10217–10227\.Cited by:[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px2.p1.1)\.
- W\. Jin, J\. Wohlwend, R\. Barzilay, and T\. Jaakkola \(2022b\)Iterative refinement graph neural network for antibody sequence\-structure co\-design\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px2.p1.1)\.
- N\. Kim, M\. Kim, and J\. Park \(2024\)Anfinsen goes neural: a graphical model for conditional antibody design\.arXiv preprint arXiv:2402\.05982\.Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1)\.
- X\. Kong, W\. Huang, and Y\. Liu \(2023a\)Conditional antibody design as 3D equivariant graph translation\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1),[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px2.p1.1)\.
- X\. Kong, W\. Huang, and Y\. Liu \(2023b\)End\-to\-end full\-atom antibody design\.InInternational Conference on Machine Learning,pp\. 17409–17429\.Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1),[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px1.p1.1),[§4\.6](https://arxiv.org/html/2605.21610#S4.SS6.p1.4),[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px2.p1.1)\.
- A\. J\. Li, V\. Sundar, G\. Grigoryan, and A\. E\. Keating \(2023\)Neural network\-derived Potts models for structure\-based protein design using backbone atomic coordinates and tertiary motifs\.Protein Science32\(2\),pp\. e4554\.External Links:[Document](https://dx.doi.org/10.1002/pro.4554)Cited by:[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Li, Y\. Lang, C\. Xu, Y\. Zhou, Z\. Pang, and P\. J\. Greisen \(2025\)Benchmarking inverse folding models for antibody CDR sequence design\.PLOS ONE20\(6\),pp\. e0324566\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0324566)Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1),[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Luo, Y\. Su, X\. Peng, S\. Wang, J\. Peng, and J\. Ma \(2022\)Antigen\-specific antibody design and optimization with diffusion\-based generative models for protein structures\.Advances in Neural Information Processing Systems35,pp\. 9754–9767\.Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1),[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px2.p1.1)\.
- D\. S\. Marks, L\. J\. Colwell, R\. Sheridan, T\. A\. Hopf, A\. Pagnani, R\. Zecchina, and C\. Sander \(2011\)Protein 3D structure computed from evolutionary sequence variation\.PLoS ONE6\(12\),pp\. e28766\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0028766)Cited by:[§4\.4](https://arxiv.org/html/2605.21610#S4.SS4.p2.2)\.
- M\. Nickel and D\. Kiela \(2017\)Poincaré embeddings for learning hierarchical representations\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§4\.3](https://arxiv.org/html/2605.21610#S4.SS3.p1.5)\.
- D\. Perera, V\. Letzelter, T\. Mariotte, A\. Cortés, M\. Chen, S\. Essid, and G\. Richard \(2024\)Annealed multiple choice learning: overcoming limitations of winner\-takes\-all with annealing\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px3.p1.1),[§4\.4](https://arxiv.org/html/2605.21610#S4.SS4.SSS0.Px1.p1.5)\.
- L\. Potocnakova, M\. Bhide, and L\. B\. Pulzova \(2016\)An introduction to b\-cell epitope mapping and in silico epitope prediction\.Journal of immunology research2016\.Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1)\.
- V\. G\. Satorras, E\. Hoogeboom, and M\. Welling \(2021\)E \(n\) equivariant graph neural networks\.InInternational conference on machine learning,pp\. 9323–9332\.Cited by:[§4\.2](https://arxiv.org/html/2605.21610#S4.SS2.p2.2)\.
- F\. Sestak, L\. Schneckenreiter, J\. Brandstetter, S\. Hochreiter, A\. Mayr, and G\. Klambauer \(2026\)VN\-EGNN: E\(3\)\-equivariant graph neural networks with virtual nodes enhance protein binding site identification\.Journal of Cheminformatics18,pp\. 11\.External Links:[Document](https://dx.doi.org/10.1186/s13321-025-01127-9)Cited by:[§3\.2](https://arxiv.org/html/2605.21610#S3.SS2.p1.7)\.
- C\. Tan, Y\. Zhang, Z\. Gao, Y\. Huang, H\. Lin, L\. Wu, F\. Wu, M\. Blanchette, and S\. Z\. Li \(2025\)DyAb: flow matching for flexible antibody design with alphafold\-driven pre\-binding antigen\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 782–790\.Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1),[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px2.p1.1)\.
- T\. Uçar and P\. Sormanni \(2025\)BLOSUM is all you learn—generative antibody models reflect evolutionary priors\.bioRxiv,pp\. 2025–10\.Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1),[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Verma, M\. Heinonen, and V\. Garg \(2023\)Abode: ab initio antibody design using conjoined odes\.InInternational Conference on Machine Learning,pp\. 35037–35050\.Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1),[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px2.p1.1)\.
- Z\. Wang, Y\. Ji, J\. Tian, and S\. Zheng \(2024\)Retrieval augmented diffusion model for structure\-informed antibody design and optimization\.arXiv preprint arXiv:2410\.15040\.Cited by:[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px2.p1.1)\.
- J\. Wu, X\. Kong, N\. Sun, J\. Wei, S\. Shan, F\. Feng, F\. Wu, J\. Peng, L\. Zhang, Y\. Liu, and J\. Ma \(2025a\)FlowDesign: improved design of antibody cdrs through flow matching and better prior distributions\.Cell Systems\.External Links:[Document](https://dx.doi.org/10.1016/j.cels.2025.101270)Cited by:[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Wu, H\. Lin, Y\. Huang, Z\. Gao, C\. Tan, Y\. Liu, T\. Wu, and S\. Z\. Li \(2025b\)Relation\-aware equivariant graph networks for epitope\-unknown antibody design and specificity optimization\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 895–904\.Cited by:[§1](https://arxiv.org/html/2605.21610#S1.p1.1),[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.21610#S2.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2605.21610#S4.SS2.p2.2),[§5\.1](https://arxiv.org/html/2605.21610#S5.SS1.SSS0.Px2.p1.1)\.

## Appendix AAppendix

### A\.1Proofs

###### Proof of Proposition[1](https://arxiv.org/html/2605.21610#Thmproposition1)\.

The expected per\-position cross\-entropy decomposes as𝔼\[−logpθ\(si\)\]=H\(p¯i\)\+DKL\(p¯i∥pθ\(⋅∣i\)\)\\mathbb\{E\}\[\-\\log p\_\{\\theta\}\(s\_\{i\}\)\]=H\(\\bar\{p\}\_\{i\}\)\+D\_\{\\mathrm\{KL\}\}\(\\bar\{p\}\_\{i\}\\\|p\_\{\\theta\}\(\\cdot\\mid i\)\), whereHHis the entropy andDKLD\_\{\\mathrm\{KL\}\}is the Kullback\-Leibler divergence\. By non\-negativity of KL divergence, the minimum is achieved whenpθ\(⋅∣i\)=p¯ip\_\{\\theta\}\(\\cdot\\mid i\)=\\bar\{p\}\_\{i\}for all positionsii\. Sincep¯i\\bar\{p\}\_\{i\}is computed over all training antigens, the optimal predictor is independent of the antigen input\. ∎

###### Proof of Proposition[4](https://arxiv.org/html/2605.21610#Thmproposition4)\.

At temperatureτ→0\\tau\\to 0, the Boltzmann weightswkw\_\{k\}become a hard assignment: each training example\(An,𝐬n\)\(A\_\{n\},\\mathbf\{s\}\_\{n\}\)is assigned to its best\-performing componentk∗​\(n\)=arg⁡mink⁡ℓk​\(n\)k^\{\*\}\(n\)=\\arg\\min\_\{k\}\\ell\_\{k\}\(n\)\. Let𝒮k=\{n:k∗​\(n\)=k\}\\mathcal\{S\}\_\{k\}=\\\{n:k^\{\*\}\(n\)=k\\\}be the subset assigned to componentkk\. The loss for componentkkthen reduces to a cross\-entropy over𝒮k\\mathcal\{S\}\_\{k\}only, whose optimum ispk∗​\(a∣i\)=\|\{n∈𝒮k:sn,i=a\}\|/\|𝒮k\|p\_\{k\}^\{\*\}\(a\\mid i\)=\|\\\{n\\in\\mathcal\{S\}\_\{k\}:s\_\{n,i\}=a\\\}\|/\|\\mathcal\{S\}\_\{k\}\|\. When different antigens induce different amino acid preferences, the partitioning groups antigen\-compatible examples together, allowing each component to specialize\. The global marginal is recovered only in the degenerate case whereK=1K=1or all subsets have identical amino acid distributions\. ForK\>1K\>1with heterogeneous training data, the component\-conditional marginals generically differ from the global marginal, breaking the ceiling of Proposition[1](https://arxiv.org/html/2605.21610#Thmproposition1)\. ∎

###### Proof of Proposition[3](https://arxiv.org/html/2605.21610#Thmproposition3)\.

We decompose the pipeline into three stages and verify each preserves the required symmetry\.

Stage 1: VN\-EGNN\.Letg=\(R,𝐭\)g=\(R,\\mathbf\{t\}\)act on all input coordinates as𝐱i↦R​𝐱i\+𝐭\\mathbf\{x\}\_\{i\}\\mapsto R\\mathbf\{x\}\_\{i\}\+\\mathbf\{t\}\. Coordinate differences transform asΔ​𝐱i​j↦R​Δ​𝐱i​j\\Delta\\mathbf\{x\}\_\{ij\}\\mapsto R\\Delta\\mathbf\{x\}\_\{ij\}\(translation cancels\)\. The Gram matrix radial feature has entries\[Δ​𝐱i​Δ​𝐱j⊤\]a​b=⟨Δ​𝐱i\(a\),Δ​𝐱j\(b\)⟩\[\\Delta\\mathbf\{x\}\_\{i\}\\Delta\\mathbf\{x\}\_\{j\}^\{\\top\}\]\_\{ab\}=\\langle\\Delta\\mathbf\{x\}\_\{i\}^\{\(a\)\},\\Delta\\mathbf\{x\}\_\{j\}^\{\(b\)\}\\rangle\. Under rotation,⟨R​𝐮,R​𝐯⟩=𝐮⊤​R⊤​R​𝐯=⟨𝐮,𝐯⟩\\langle R\\mathbf\{u\},R\\mathbf\{v\}\\rangle=\\mathbf\{u\}^\{\\top\}R^\{\\top\}R\\mathbf\{v\}=\\langle\\mathbf\{u\},\\mathbf\{v\}\\rangle, so the Gram matrix is E\(3\)\-invariant\. All other edge features \(distances, quaternion orientations, RBFs\) depend only on pairwise distances and local frames, which are E\(3\)\-invariant\. The message𝐦i​j=MLP​\(\[𝐡i,𝐡j,invariant features\]\)\\mathbf\{m\}\_\{ij\}=\\text\{MLP\}\(\[\\mathbf\{h\}\_\{i\},\\mathbf\{h\}\_\{j\},\\text\{invariant features\}\]\)is therefore invariant, and the node update𝐡i′\\mathbf\{h\}\_\{i\}^\{\\prime\}is invariant\. The coordinate update𝐱i′=𝐱i\+∑jΔ​𝐱i​j⊙f​\(𝐦i​j\)\\mathbf\{x\}\_\{i\}^\{\\prime\}=\\mathbf\{x\}\_\{i\}\+\\sum\_\{j\}\\Delta\\mathbf\{x\}\_\{ij\}\\odot f\(\\mathbf\{m\}\_\{ij\}\)is equivariant: undergg,R​𝐱i\+𝐭\+∑jR​Δ​𝐱i​j⊙f​\(𝐦i​j\)=R​𝐱i′\+𝐭R\\mathbf\{x\}\_\{i\}\+\\mathbf\{t\}\+\\sum\_\{j\}R\\Delta\\mathbf\{x\}\_\{ij\}\\odot f\(\\mathbf\{m\}\_\{ij\}\)=R\\mathbf\{x\}\_\{i\}^\{\\prime\}\+\\mathbf\{t\}\.

Stage 2: Hyperbolic cross\-attention\.The module receives GNN node features𝐡\\mathbf\{h\}\(invariant from Stage 1\)\. All operations \(linear projections, hyperboloid mapping, Minkowski inner product, softmax, value aggregation\) act on these invariant features without accessing coordinates\. The output is therefore E\(3\)\-invariant\.

Stage 3: MDN\-Potts head\.The sequence head receives the concatenation of invariant features from Stages 1–2\. All operations \(layer norm, linear projections, softmax, belief propagation\) preserve invariance\. The predicted sequence probabilitiesp​\(𝐬∣𝐗,𝐒\)p\(\\mathbf\{s\}\\mid\\mathbf\{X\},\\mathbf\{S\}\)are therefore E\(3\)\-invariant, establishing \(ii\)\. The coordinate predictions𝐙^\\hat\{\\mathbf\{Z\}\}come from the GNN’s equivariant coordinate stream, establishing \(i\)\. ∎

### A\.2Graph Construction Details

Table 3:Edge types in the heterogeneous antibody\-antigen graph\.Each residue is represented by a 108\-dimensional feature vector: sequential position embeddings \(16D via sinusoidal encoding\), distance radial basis functions for three backbone bond lengths N–Cα\\alpha, Cα\\alpha–C, and C–O \(each expanded into 16 Gaussian basis functions, totaling 48D\), backbone dihedral and bond angles as sine\-cosine pairs \(12D\), local coordinate frame directional features constructed from backbone unit vectors rotated into a local reference frame \(9D\), amino acid identity \(20D, masked to zero for CDR positions during training\), and interface complementarity features \(4D\)\. These 105 features are concatenated with a 3D segment type embedding that distinguishes heavy chain, light chain, and antigen residues\. A dual\-path encoder processes the dense features \(101D\) and sparse complementarity features \(4D\) through separate two\-layer MLPs with SiLU activations, fuses them via concatenation, and projects the result to the embedding dimensiond=128d=128\. Epitope residues receive an additional learnable embedding \(zero\-initialized\) added to the projected features\.

Each edge carries a 104\-dimensional feature vector composed of a type\-specific one\-hot encoding \(8D for the 8 standard types\), relative position embeddings \(16D\), pairwise distance radial basis functions for four atom pairs N–Cα\\alpha, Cα\\alpha–Cα\\alpha, C–Cα\\alpha, and O–Cα\\alpha\(each expanded into 16 Gaussian basis functions, totaling 64D\), quaternion\-encoded relative orientation \(4D\), and local frame directional features for the four atom pairs \(12D\)\. Positional features are zeroed for inter\-chain and global edges \(types 1, 2, 6, 7\) since relative sequence position is undefined across chains\. Virtual node edges \(types 8–9\) use learnable feature vectors initialized from a small random distribution\.

### A\.3Algorithms

Algorithm 1Training0:Dataset

𝒟\\mathcal\{D\}, model

fθf\_\{\\theta\}, epochs

TT, aMCL schedule

τ​\(t\)\\tau\(t\)
1:for

t=1t=1to

TTdo

2:

τ←τstart⋅\(τend/τstart\)t/Tanneal\\tau\\leftarrow\\tau\_\{\\text\{start\}\}\\cdot\(\\tau\_\{\\text\{end\}\}/\\tau\_\{\\text\{start\}\}\)^\{t/T\_\{\\text\{anneal\}\}\}
3:foreach batch

ℬ⊂𝒟\\mathcal\{B\}\\subset\\mathcal\{D\}do

4:Apply framework dropout: mask heavy\-chain embeddings with probability

pp
5:Build heterogeneous graph

𝒢\\mathcal\{G\}with 10 edge types

6:

\{𝐡i,𝐱^i\}←VN\-EGNN​\(𝒢\)\\\{\\mathbf\{h\}\_\{i\},\\hat\{\\mathbf\{x\}\}\_\{i\}\\\}\\leftarrow\\text\{VN\-EGNN\}\(\\mathcal\{G\}\)\{5 layers\}

7:

𝐳i←HypAttn​\(𝐡CDR,𝐡epi\)\\mathbf\{z\}\_\{i\}\\leftarrow\\text\{HypAttn\}\(\\mathbf\{h\}\_\{\\text\{CDR\}\},\\mathbf\{h\}\_\{\\text\{epi\}\}\)\{Lorentz hyperboloid\}

8:

𝐟i←\[𝐡i;𝐳i;ESMproj​\(𝐞i\)\]\\mathbf\{f\}\_\{i\}\\leftarrow\[\\mathbf\{h\}\_\{i\};\\mathbf\{z\}\_\{i\};\\text\{ESM\}\_\{\\text\{proj\}\}\(\\mathbf\{e\}\_\{i\}\)\]\{768\-dim\}

9:

\{ℓk,𝝅\}←MDN\-Potts​\(𝐟\)\\\{\\boldsymbol\{\\ell\}\_\{k\},\\boldsymbol\{\\pi\}\\\}\\leftarrow\\text\{MDN\-Potts\}\(\\mathbf\{f\}\)\{KK=4, 2\-round BP\}

10:Compute

wk=softmax​\(−ℓk/τ\)w\_\{k\}=\\text\{softmax\}\(\-\\ell\_\{k\}/\\tau\)\{aMCL weights\}

11:

ℒ←ℒseq\+α​ℒcoord\+δ​ℒshadow\+ϵ​ℒGDPP\+λcls​ℒcls\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\_\{\\text\{seq\}\}\+\\alpha\\mathcal\{L\}\_\{\\text\{coord\}\}\+\\delta\\mathcal\{L\}\_\{\\text\{shadow\}\}\+\\epsilon\\mathcal\{L\}\_\{\\text\{GDPP\}\}\+\\lambda\_\{\\text\{cls\}\}\\mathcal\{L\}\_\{\\text\{cls\}\}
12:Update

θ\\thetavia AdamW with gradient clipping \(max norm 0\.5\)

13:endfor

14:Decay learning rate:

η←η⋅γ\\eta\\leftarrow\\eta\\cdot\\gamma
15:Evaluate on validation set; early stop if no improvement for 10 epochs

16:endfor

Algorithm 2Inference0:Trained model

fθf\_\{\\theta\}, test complex with antigen \+ antibody framework

1:Build graph

𝒢\\mathcal\{G\}\(CDR AA masked, coordinates from framework\)

2:

\{𝐡i,𝐱^i\}←VN\-EGNN​\(𝒢\)\\\{\\mathbf\{h\}\_\{i\},\\hat\{\\mathbf\{x\}\}\_\{i\}\\\}\\leftarrow\\text\{VN\-EGNN\}\(\\mathcal\{G\}\)
3:

𝐳i←HypAttn​\(𝐡CDR,𝐡epi\)\\mathbf\{z\}\_\{i\}\\leftarrow\\text\{HypAttn\}\(\\mathbf\{h\}\_\{\\text\{CDR\}\},\\mathbf\{h\}\_\{\\text\{epi\}\}\)
4:

𝐟i←\[𝐡i;𝐳i;ESMproj​\(𝐞i\)\]\\mathbf\{f\}\_\{i\}\\leftarrow\[\\mathbf\{h\}\_\{i\};\\mathbf\{z\}\_\{i\};\\text\{ESM\}\_\{\\text\{proj\}\}\(\\mathbf\{e\}\_\{i\}\)\]
5:

\{ℓk,𝝅\}←MDN\-Potts​\(𝐟\)\\\{\\boldsymbol\{\\ell\}\_\{k\},\\boldsymbol\{\\pi\}\\\}\\leftarrow\\text\{MDN\-Potts\}\(\\mathbf\{f\}\)
6:foreach CDR position

iido

7:

k∗←arg⁡maxk⁡πk​\(i\)k^\{\*\}\\leftarrow\\arg\\max\_\{k\}\\pi\_\{k\}\(i\)\{Best component\}

8:

s^i←arg⁡maxa⁡ℓk∗\(i\)​\(a\)\\hat\{s\}\_\{i\}\\leftarrow\\arg\\max\_\{a\}\\ell\_\{k^\{\*\}\}^\{\(i\)\}\(a\)\{Greedy decode\}

9:endfor

10:returnPredicted sequence

𝐬^\\hat\{\\mathbf\{s\}\}, predicted coordinates

𝐗^CDR\\hat\{\\mathbf\{X\}\}\_\{\\text\{CDR\}\}

### A\.4Implementation Details\.

AgForceuses a 5\-layer VN\-EGNN \(dim 128/256, 3 virtual nodes\), 4\-head hyperbolic cross\-attention \(c=1\.0c=1\.0\), MDN\-Potts head \(K=4K=4, 2\-round BP\), and frozen ESM\-2 \(650M\) embeddings\. aMCL annealsτ\\taufrom 2\.0 to 0\.1 over 20 epochs\. Model is trained for a maximum of 50 epochs, with early stopping patience of 10, on a single H100, and it takes around 2 hours for full training\. Full hyperparameters in Table[4](https://arxiv.org/html/2605.21610#A1.T4)\.

Table 4:Hyperparameters forAgForce\.ComponentParameterValueSourceVN\-EGNNLayers5SweepEmbedding dim128SweepHidden dim256SweepVirtual nodes3SweepEdge types10 \(8\+2 VN\)ArchitectureHyperbolic AttnHeads4StandardCurvaturecc1\.0FixedKey/ValueEpitope GNN hiddenStandardMDN\-PottsComponentsKK4SweepMessage passes2ArchitectureCoupling modeAdjacentArchitectureλpair\\lambda\_\{\\text\{pair\}\}0\.3SweepShared hidden dim384ArchitectureaMCLτstart\\tau\_\{\\text\{start\}\}2\.0Sweepτend\\tau\_\{\\text\{end\}\}0\.1SweepAnneal epochs20SweepScheduleExponentialFixedTrainingLR2\.2×10−42\.2\\times 10^\{\-4\}SweepLR decay \(γ\\gamma\)0\.955/epochSweepBatch size8MemoryMax epochs50ConvergenceEarly stopping10 epochsStandardLoss weightsα\\alpha\(coord\)1\.301Sweepδ\\delta\(shadow\)0\.664Sweepϵ\\epsilon\(GDPP\)0\.05Sweepλcls\\lambda\_\{\\text\{cls\}\}\(antigen cls\)0\.2SweepFW dropoutpp0\.3Sweep
### A\.5Per\-CDR Results

Table 5:Per\-CDR comparison onChimera\-Bench\(epitope\-group split, means only\)\. AAR and Cα\\alphaRMSD \(Å\) for each heavy\-chain CDR\. Best inbold, second\-bestunderlined\.AgForcematches or exceeds all baselines on the shorter, more conserved CDR\-H1 and CDR\-H2 loops\. On CDR\-H1,AgForceachieves AAR of 0\.72 \(matching RAAD\) with the best RMSD of 0\.50 Å\. On CDR\-H2,AgForceachieves AAR of 0\.63 \(matching RAAD\) with a RMSD of 0\.51 Å\. These loops are structurally more constrained than H3, and the improvements are accordingly smaller\. The consistent gains across all three heavy\-chain CDRs indicate that the MDN\-Potts head and antigen forcing strategy generalize beyond the highly variable H3 loop\.

### A\.6Antigen Conditioning Analysis

The failure mode analysis in Section[3\.3](https://arxiv.org/html/2605.21610#S3.SS3)established that GNN baselines are largely antigen\-blind\. We now examine whetherAgForceimproves on this\.

Table 6:Antigen conditioning analysis on CDR\-H3 \(epitope\-group split\)\. Unique% = fraction of unique predicted sequences\. Ent\. Ratio = per\-position entropy relative to ground truth\. GY Ratio = glycine/tyrosine frequency relative to native\. EV = effective vocabulary\. Enrichrr= Spearman correlation of interface amino acid enrichment with ground truth pattern\.#### Sequence diversity\.

AgForceproduces 95\.5% unique predicted sequences across the 292 test complexes, compared to 76\.4% for RAAD, 66\.8% for MEAN, and 20\.9% for dyMEAN\. The per\-position entropy ratio increases from 0\.40–0\.48 for GNN baselines to 0\.70 forAgForce, confirming that predictions are more diverse and closer to native variability\. However, the entropy ratio falls short of 1\.0, and the model still overrepresents glycine and tyrosine by a factor of 1\.9 relative to ground truth\. DiffAb and RefineGNN achieve entropy ratios above 0\.88 with near\-perfect uniqueness but at the cost of much lower sequence recovery \(AAR 0\.21–0\.23\)\.

#### Vocabulary expansion\.

The effective vocabulary ofAgForceis 9\.4, compared to 3\.0–5\.5 for GNN baselines and 11\.7–14\.9 for sampling\-based methods\. The model produces 166 unique bigrams and 517 unique trigrams, improving over GNN baselines \(12–52 bigrams, 21–110 trigrams\) though still below native levels \(364 bigrams, 1818 trigrams\)\. Importantly,AgForcecovers all 20 of the top ground truth trigram motifs, whereas GNN baselines cover only 7–19 out of 20\. This indicates that the MDN\-Potts head with GDPP regularization successfully prevents the most extreme forms of vocabulary collapse\. A full heatmap visualization of the vocabulary collapse for the baseline methods is provided in

![Refer to caption](https://arxiv.org/html/2605.21610v1/figures/h3_positional_grid.png)Figure 4:Illustration of the vocabulary collapse failure mode of the baselines vs\.AgForce\.
#### Interface enrichment patterns\.

AgForceachieves the highest correlation between its interface amino acid enrichment pattern and the ground truth \(r=0\.78r=0\.78\)\. RAAD is second atr=0\.68r=0\.68, while DiffAb shows an anti\-correlated pattern \(r=−0\.29r=\-0\.29\)\. High enrichment correlation indicates that the model has learned which amino acids are appropriate at interface positions at the distributional level: it gets the right “type” of amino acid more often \(polar residues at hydrogen\-bonding contacts, aromatic residues at hydrophobic contacts\), even though the absolute contact AAR remains similar to baselines\.

### A\.7Detailed Ablation Analysis

#### Hyperbolic attention\.

Replacing hyperbolic cross\-attention with standard Euclidean multihead attention reduces fnat by 3\.7 percentage points \(0\.671 to 0\.634\) and DockQ by 1\.7 points \(0\.740 to 0\.723\)\. AAR also decreases \(0\.395 to 0\.388\)\. The Lorentz manifold distance metric provides a better inductive bias for the hierarchical structure of antibody\-antigen binding, where global epitope geometry constrains local amino acid choices at the interface\.

#### Antigen classification loss\.

Removing the InfoNCE classification loss has a modest effect on AAR \(0\.395 to 0\.393\) but reduces fnat from 0\.671 to 0\.639 and DockQ from 0\.740 to 0\.726\. Interestingly, perplexity improves \(2\.951 to 2\.535\), suggesting that the classification loss acts as a regularizer that trades off calibration for binding quality\. The loss forces the decoder to produce antigen\-specific probability distributions, which improves interface predictions even when overall sequence confidence decreases\.

#### GDPP regularization\.

Further removing GDPP from the configuration without antigen classification \(row 4 versus row 3\) produces an additional small drop in AAR \(0\.393 to 0\.390\) and fnat \(0\.639 to 0\.635\)\. The diversity regularization contributes marginally when other components are present, but its effect is additive with the antigen classification loss\.

#### Framework dropout\.

Removing the 30% heavy chain framework dropout reduces fnat from 0\.671 to 0\.630 and AAR from 0\.395 to 0\.392\. The dropout forces the model to rely more on antigen information rather than the antibody framework shortcut, and its removal degrades binding quality more than it affects sequence recovery\.

#### ESM\-2 embeddings\.

Removing ESM\-2 \(with antigen classification already disabled\) produces the largest single drop in AAR, from 0\.393 to 0\.368 \(−\-2\.5 percentage points\)\. The frozen protein language model embeddings contribute substantially to sequence prediction by providing evolutionary context that the GNN alone does not capture\. The parameter reduction \(12\.64M to 11\.95M\) reflects the removal of the ESM projection layer\.

## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract and introduction state three failure modes supported by empirical evidence in Section[3\.3](https://arxiv.org/html/2605.21610#S3.SS3)\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: Section[6](https://arxiv.org/html/2605.21610#S6)discusses limitations, including the remaining gaps in contact amino acid recovery\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[Yes\]
14. Justification: Propositions[1](https://arxiv.org/html/2605.21610#Thmproposition1)–[3](https://arxiv.org/html/2605.21610#Thmproposition3)and Corollary[2](https://arxiv.org/html/2605.21610#Thmproposition2)are stated with explicit assumptions\. Full proofs are in Appendix[A\.1](https://arxiv.org/html/2605.21610#A1.SS1)\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: Section[5](https://arxiv.org/html/2605.21610#S5)describes the dataset, splits, and evaluation protocol\. Appendix[A\.2](https://arxiv.org/html/2605.21610#A1.SS2)provides full architectural details\. Algorithms[1](https://arxiv.org/html/2605.21610#alg1)and[2](https://arxiv.org/html/2605.21610#alg2)specify the training and inference procedures\. Table[4](https://arxiv.org/html/2605.21610#A1.T4)lists all hyperparameters\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[Yes\]
24. Justification: The source code is provided as supplementary material during the review period for reproducibility\. Code and data will be released publicly upon acceptance\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Section[5](https://arxiv.org/html/2605.21610#S5)describes the benchmark, data splits, and evaluation metrics\. Table[4](https://arxiv.org/html/2605.21610#A1.T4)provides the full hyperparameter table including optimizer, learning rate, batch size, and all loss weights\. Hyperparameters were selected via Weights & Biases sweeps\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: Tables[1](https://arxiv.org/html/2605.21610#S5.T1)–[2](https://arxiv.org/html/2605.21610#S5.T2)report mean±\\pmstandard deviation across test complexes for all metrics\. The standard deviation captures variability across individual antibody\-antigen complexes\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: Appendix[A\.3](https://arxiv.org/html/2605.21610#A1.SS3)reports the GPU type \(single NVIDIA H100 80GB\) and training time\. Table[4](https://arxiv.org/html/2605.21610#A1.T4)details all configurations\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: The research uses publicly available structural data from the Protein Data Bank and SAbDab, involves no human subjects, and conforms to the NeurIPS Code of Ethics\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: Section[6](https://arxiv.org/html/2605.21610#S6)discusses the positive impact of improved computational antibody design for therapeutic development and acknowledges the potential of generative biological design methods\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: The model generates antibody CDR sequences and structures that require extensive wet\-lab validation before any practical use\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: All baseline methods and datasets are cited with their original publications\. SAbDab, the Protein Data Bank, and ESM\-2 are properly credited\. The CHIMERA\-Bench dataset and all baseline implementations are cited\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.21610v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[N/A\]
64. Justification: The paper introduces a new model but does not release new datasets or pre\-trained model assets at submission time\. Code and model weights will be released upon acceptance\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: This work does not involve crowdsourcing or research with human subjects\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: This work does not involve human subjects\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[N/A\]
79. Justification: No LLMs are used as a core methodological contribution\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.

Similar Articles

ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning

arXiv cs.LG

ConTact introduces a contact-then-act architecture for antibody CDR design that explicitly decomposes the task into interface reasoning, contact prediction, and contact-gated sequence generation, achieving state-of-the-art structural quality and epitope awareness on the Chimera-Bench benchmark.

Controllable Molecular Generative Foundation Models

arXiv cs.LG

Proposes CoMole, a controllable molecular generative foundation model using motif-aware graph diffusion and reinforcement learning, achieving superior controllability across materials and drug discovery benchmarks.

Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse

arXiv cs.LG

This paper benchmarks agentic AI systems on the task of loading, understanding, and reformatting fragmented neuroscience data, finding that while agents perform well on subtasks, they rarely achieve fully error-free end-to-end solutions and human oversight remains necessary.