Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning

arXiv cs.LG 05/13/26, 04:00 AM Papers
Summary
This paper proposes SoftBlobGIN, a framework that enhances the interpretability of protein language model representations by projecting them onto contact graphs for structure-aware message passing. It demonstrates improved performance on enzyme classification and binding-site detection while providing auditable structural explanations.
arXiv:2605.10985v1 Announce Type: new Abstract: Protein language models such as ESM-2 learn rich residue representations that achieve strong performance on protein function prediction, but their features remain difficult to interpret as structural $\&$ evolutionary signals are encoded in dense latent spaces. We propose a plug-$\&$-play framework that projects ESM-2 representations onto protein contact graphs $\&$ applies $\textbf{SoftBlobGIN}$, a lightweight Graph Isomorphism Network with differentiable Gumbel-softmax substructure pooling, to perform structure-aware message passing $\&$ learn coarse functional substructures for downstream prediction tasks. Across enzyme classification, SoftBlobGIN achieves 92.8\% accuracy $\&$ 0.898 macro-F1. Unlike post hoc analysis of protein language models alone, our method produces directly auditable structural explanations: GNNExplainer recovers biologically meaningful active-site residues, spatially localized functional clusters, $\&$ catalytic contact patterns. On binding-site detection, SoftBlobGIN improves residue AUROC from $0.885$ using an ESM-2 linear probe to $0.983$, indicating that these structural explanations are not recoverable from language-model features alone. Learned blob partitions provide an additional layer of interpretability by automatically grouping residues into functional substructures, with blobs containing annotated active-site residues showing $1.85\times$ higher importance than other blobs ($\rho{=}0.339$, $p{=}0.009$), without any active-site supervision. Our framework requires no retraining of the language model, adds only $\sim$1.1M parameters, $\&$ generalises across ProteinShake tasks, achieving $F_{\max}$ of $0.733$ on Gene Ontology prediction $\&$ AUROC of $0.969$ on binding-site detection. We position this as an interpretable structural companion to protein language models that makes their predictions more transparent $\&$ auditable.
Original Article
View Cached Full Text
Cached at: 05/13/26, 06:25 AM
# Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning
Source: [https://arxiv.org/html/2605.10985](https://arxiv.org/html/2605.10985)
Edward Tan Beng Wai11footnotemark:1Soumick Sarker11footnotemark:1Pasan Gunawardane11footnotemark:1Jagath C\. Rajapakse Nanyang Technological University, Singapore \{siddhant010, soumick001, ed0001ai, c250135\}@e\.ntu\.edu\.sg ASJagath@ntu\.edu\.sg

###### Abstract

Protein language models such as ESM\-2 learn rich residue representations that achieve strong performance on protein function prediction, but their features remain difficult to interpret as structural & evolutionary signals are encoded in dense latent spaces\. We propose a plug\-&\-play framework that projects ESM\-2 representations onto protein contact graphs & appliesSoftBlobGIN, a lightweight Graph Isomorphism Network with differentiable Gumbel\-softmax substructure pooling, to perform structure\-aware message passing & learn coarse functional substructures for downstream prediction tasks\. Across enzyme classification, SoftBlobGIN achieves 92\.8% accuracy & 0\.898 macro\-F1\. Unlike post hoc analysis of protein language models alone, our method produces directly auditable structural explanations: GNNExplainer recovers biologically meaningful active\-site residues, spatially localized functional clusters, & catalytic contact patterns\. On binding\-site detection, SoftBlobGIN improves residue AUROC from0\.8850\.885using an ESM\-2 linear probe to0\.9830\.983, indicating that these structural explanations are not recoverable from language\-model features alone\. Learned blob partitions provide an additional layer of interpretability by automatically grouping residues into functional substructures, with blobs containing annotated active\-site residues showing1\.85×1\.85\\timeshigher importance than other blobs \(ρ=0\.339\\rho\{=\}0\.339,p=0\.009p\{=\}0\.009\), without any active\-site supervision\. Our framework requires no retraining of the language model, adds only∼\\sim1\.1M parameters, & generalises across ProteinShake tasks, achievingFmaxF\_\{\\max\}of0\.7330\.733on Gene Ontology prediction & AUROC of0\.9690\.969on binding\-site detection\. We position this as an interpretable structural companion to protein language models that makes their predictions more transparent & auditable\.

## 1Introduction

Despite the explosion of sequence databases, a large fraction of sequenced proteins remain functionally unannotated\(Kustatscheret al\.,[2022](https://arxiv.org/html/2605.10985#bib.bib6)\)\. The rise of large\-scale structural data, such as AlphaFold 2\(Jumperet al\.,[2021](https://arxiv.org/html/2605.10985#bib.bib3)\)provides predicted structures for over 200 million proteins\. Recently, Protein Language Models \(PLMs\) such as ESM\-2\(Linet al\.,[2023](https://arxiv.org/html/2605.10985#bib.bib8)\)have transformed function prediction\. ESM\-2 in particular has become a*de facto*feature extractor: simple multilayer perceptrons \(MLPs\) on its mean\-pooled embeddings already match or exceed structure\-based graph neural networks \(GNNs\) on many tasks\. This success, however, comes at the cost of*interpretability*\. ESM\-2 embeddings are 1280\-dimensional dense vectors with no obvious mapping to specific residues, contacts, or biochemical motifs\. However, many downstream tasks need interpretability, specifically to deploy in clinical settings where safety & regulatory compliance are paramount\. Furthermore, interpretable models allow researchers to verify that computational predictions correspond to meaningful biological mechanisms as opposed to spurious data correlations, potentially discovering novel insights\.

Existing language\-model probes \(linear classifiers on attention heads, attention rollout, etc\.\) recover broad sequence patterns but rarely surface spatially localised, biochemically specific motifs\. On the other hand, structural methods such as conventional GNNs on protein contact graphs often use fixed\-radius neighbourhoods, limiting flexibility to empirically defined constants\. This means a catalytic residue forming a 3 Å hydrogen bond with its substrate & a surface residue 7\.9 Å from a distant loop receive identical graph topology\. Recent works\(Wang and Oliver,[2025](https://arxiv.org/html/2605.10985#bib.bib13)\)address this by using variable\-size partitions, based on a Geometric Vector Perceptron \(GVP\) encoder & a vector\-quantized \(VQ\) codebook, though at the cost of greater computation, & an interpretability gap\. Importantly, these methods do not leverage the expressivity of recent PLM representations\.

This raises the question ofwhenstructural reasoning adds information beyond what PLMs already capture\. Our empirical answer is that the boundary is biological\. For graph\-level function tasks like enzyme classification \(EC\), ESM\-2 mean\-pooling is nearly sufficient & graph structure adds little insights\. For residue\-level structural tasks like binding\-site detection, message\-passing over the contact graph adds substantial information that ESM\-2 alone cannot recover\. The interesting consequence is interpretability: in the regime where structure matters, we want models whose structural reasoning is auditable\. We therefore propose a computationally lightweight, structurally interpretable GNN, which jointly leverages semantically rich ESM\-2 features, while remaining interpretable\. To that end, we list our contributions:

1. 1\.Empirical characterisation\.We map when structural reasoning helps frozen PLM features\. For graph\-level EC task, ESM\-2 mean\-pooling is nearly sufficient \(0\.9100\.910vs0\.9120\.912accuracy with vs without the contact graph\)\. For residue\-level binding\-site detection, message\-passing over the contact graph closes a9\.89\.8\-point AUROC gap that ESM\-2 alone cannot\.
2. 2\.Interpretable architecture\.We introduceSoftBlobGINthat replaces BioBlobs’ GVP encoder & VQ codebook with a single Gumbel\-softmax assignment head\(Janget al\.,[2016](https://arxiv.org/html/2605.10985#bib.bib2)\), producingKKdifferentiable, soft protein substructures with∼1\.1\{\\sim\}1\.1M parameters & no language\-model retraining\.
3. 3\.Biological validation\.We quantitatively validate the resulting explanations against established enzyme biochemistry\. GNNExplainer\(Yinget al\.,[2019](https://arxiv.org/html/2605.10985#bib.bib19)\)recovers catalytic\-residue enrichment, active\-site burial, spatial co\-localisation, & tertiary\-contact geometry consistent with catalytic triads\. Learned blobs spontaneously separate functional sites from structural scaffold, with active\-site\-containing blobs carrying1\.85×1\.85\\timeshigher importance \(ρ=0\.339\\rho\{=\}0\.339,p=0\.009p\{=\}0\.009\)\.

## 2Related Work

#### Graph neural networks for protein structure\.

Protein structures are naturally represented as residue contact graphs where nodes correspond to amino acids & edges encode spatial proximity or geometric relations\. Conventional GNNs such as Graph Convolutional Networks \(GCN\)\(Kipf and Welling,[2016](https://arxiv.org/html/2605.10985#bib.bib4)\)& Graph Attention Networks \(GAT\)\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.10985#bib.bib12)\)have therefore been widely adopted for structure\-based protein learning\. However, these architectures are limited in expressive power\.Xuet al\.\([2018a](https://arxiv.org/html/2605.10985#bib.bib17)\)showed that both GCN & GAT are strictly less expressive than the Weisfeiler–Leman \(WL\) graph isomorphism test, motivating the introduction of Graph Isomorphism Networks \(GIN\), which match the WL upper bound\. Subsequent work extended GIN to incorporate edge information through GINEConv\(Huet al\.,[2019](https://arxiv.org/html/2605.10985#bib.bib1)\), enabling explicit modeling of pairwise residue geometry such as Cα\\alpha–Cα\\alphadistances\. These developments make GIN\-style architectures a natural foundation for protein graphs, where local geometric interactions are often functionally informative\.

#### Protein language models \(PLMs\)\.

In parallel, PLMs have substantially advanced function prediction by learning sequence\-derived representations at scale\. Models such as ESM\-2\(Linet al\.,[2023](https://arxiv.org/html/2605.10985#bib.bib8)\)are trained on approximately 65 million protein sequences by using masked language modeling & produce rich per\-residue embeddings encoding evolutionary conservation, structural regularities, & functional context\. In many downstream settings, frozen ESM\-2 features combined with lightweight classifiers already achieve competitive or state\-of\-the\-art performance\. This has established PLM as strong general feature extractors, but has also introduced an interpretability challenge\. Because these representations are high\-dimensional dense vectors, the structural or biochemical signals responsible for a prediction are not directly observable\. Our work builds on this observation by treating ESM\-2 as a frozen semantic encoder while introducing explicit graph\-based structural reasoning\.

#### Hierarchical pooling & structural abstraction\.

Beyond residue\-level message passing, hierarchical pooling methods aim to learn coarse structural abstractions by grouping residues into higher\-level substructures\. DiffPool\(Yinget al\.,[2018](https://arxiv.org/html/2605.10985#bib.bib18)\)introduced differentiable soft cluster assignments for hierarchical graph coarsening, enabling end\-to\-end learning of graph hierarchies\. More recently, BioBlobs\(Wang and Oliver,[2025](https://arxiv.org/html/2605.10985#bib.bib13)\)adapted this paradigm to proteins through biologically motivated blob partitions, demonstrating strong performance on ProteinShake benchmarks\. However, this approach relies on a substantially heavier architecture involving Geometric Vector Perceptrons & vector quantization\. In contrast, our SoftBlobGIN replaces these components with lightweight Gumbel\-softmax pooling\(Janget al\.,[2016](https://arxiv.org/html/2605.10985#bib.bib2)\), allowing differentiable learning of soft functional substructures with substantially lower computational overhead\.

#### Explainability for graph neural networks\.

Interpreting GNN predictions has motivated a growing body of post hoc explanation methods\. GNNExplainer\(Yinget al\.,[2019](https://arxiv.org/html/2605.10985#bib.bib19)\)learns continuous edge & feature masks by maximizing mutual information between selected subgraphs & model predictions, providing instance\-specific substructure explanations\. Integrated Gradients\(Sundararajanet al\.,[2017](https://arxiv.org/html/2605.10985#bib.bib10)\)offers a complementary attribution framework by integrating gradients along a path from a baseline input to the observed example\. These methods have become standard tools for probing graph models, with evaluation typically based on metrics such as fidelity, sparsity, & characterization\(Popeet al\.,[2019](https://arxiv.org/html/2605.10985#bib.bib9); Yuanet al\.,[2022](https://arxiv.org/html/2605.10985#bib.bib20)\)\. In this work, we employ both methods to evaluate whether structure\-aware reasoning over frozen PLM embeddings produces biologically meaningful & auditable explanations\.

## 3Problem & Evaluation Criteria

We aim to learn a function classifierfθf\_\{\\theta\}over protein contact graphs that produces sparse, biologically faithful explanations for its predictions\. An explanation is a pair of continuous masks\(M,F\)\(M,F\)over edges & node features\. We evaluate explanations along two axes:

#### Predictive faithfulness\.

Standard fidelity\-based metrics applied to graph explanations: sparsity, sufficiency \(Fid\+\), necessity \(Fid\-\), & intra\-class feature\-mask stability \(definitions in Appendix[C](https://arxiv.org/html/2605.10985#A3)\)\.

#### Biological faithfulness\.

Predictive metrics are necessary but insufficient: an explanation can be faithful to the model & still biologically meaningless\. We additionally require explanations to align with established enzyme biochemistry along four axes \(B1\)\-\(B4\):

#### \(B1\) Catalytic\-residue enrichment\.

For each EC classcc& amino acida∈Σa\\in\\Sigma, letp^c,atop\\hat\{p\}\_\{c,a\}^\{\\mathrm\{top\}\}&p^c,abg\\hat\{p\}\_\{c,a\}^\{\\mathrm\{bg\}\}be the empirical frequencies ofaainℐ0\.20\(G\)\\mathcal\{I\}\_\{0\.20\}\(G\)&VV, respectively\. The log\-enrichment is

Enr\(c,a\)=log2⁡p^c,atop\+δp^c,abg\+δ,δ=10−6\.\\mathrm\{Enr\}\(c,a\)\\;=\\;\\log\_\{2\}\\frac\{\\hat\{p\}\_\{c,a\}^\{\\mathrm\{top\}\}\+\\delta\}\{\\hat\{p\}\_\{c,a\}^\{\\mathrm\{bg\}\}\+\\delta\},\\qquad\\delta=10^\{\-6\}\.\(1\)We test the hypothesis𝔼a\[Enr\(c,a\)∣a∈Cat\]\>𝔼a\[Enr\(c,a\)∣a∉Cat\]\\mathbb\{E\}\_\{a\}\\\!\\big\[\\mathrm\{Enr\}\(c,a\)\\mid a\\in\\mathrm\{Cat\}\\big\]\>\\mathbb\{E\}\_\{a\}\\\!\\big\[\\mathrm\{Enr\}\(c,a\)\\mid a\\notin\\mathrm\{Cat\}\\big\]whereCat=\{H,C,S,D,E,K,R,Y\}\\mathrm\{Cat\}=\\\{\\mathrm\{H,C,S,D,E,K,R,Y\}\\\}\.

#### \(B2\) Active\-site burial\.

The expected SASA gap between important & unimportant residues:

ΔSASA\(c\)=𝔼i∈ℐ0\.20\(G\)\[SASAi\]−𝔼i∉ℐ0\.20\(G\)\[SASAi\]<0\.\\Delta\_\{\\mathrm\{SASA\}\}\(c\)\\;=\\;\\mathbb\{E\}\_\{i\\in\\mathcal\{I\}\_\{0\.20\}\(G\)\}\\\!\\big\[\\mathrm\{SASA\}\_\{i\}\\big\]\\;\-\\;\\mathbb\{E\}\_\{i\\notin\\mathcal\{I\}\_\{0\.20\}\(G\)\}\\\!\\big\[\\mathrm\{SASA\}\_\{i\}\\big\]\\;<\\;0\.\(2\)

#### \(B3\) Spatial co\-localisation\.

LetD¯\(ℐ\)=\(\|ℐ\|2\)−1∑i<j∈ℐ‖ci−cj‖2\\bar\{D\}\(\\mathcal\{I\}\)=\\binom\{\|\\mathcal\{I\}\|\}\{2\}^\{\-1\}\\sum\_\{i<j\\in\\mathcal\{I\}\}\\\|c\_\{i\}\-c\_\{j\}\\\|\_\{2\}be the mean pairwise distance\. SamplingB=100B=100random subsetsℛ\(b\)⊂V\\mathcal\{R\}^\{\(b\)\}\\subset Vwith\|ℛ\(b\)\|=\|ℐ0\.20\(G\)\|\|\\mathcal\{R\}^\{\(b\)\}\|=\|\\mathcal\{I\}\_\{0\.20\}\(G\)\|, the spatial\-clusteringzz\-score is

Zspatial\(G\)=D¯\(ℐ0\.20\(G\)\)−μRσR,μR=1B∑bD¯\(ℛ\(b\)\),σR=stdb\[D¯\(ℛ\(b\)\)\]\.Z\_\{\\mathrm\{spatial\}\}\(G\)\\;=\\;\\frac\{\\bar\{D\}\\big\(\\mathcal\{I\}\_\{0\.20\}\(G\)\\big\)\-\\mu\_\{R\}\}\{\\sigma\_\{R\}\},\\quad\\mu\_\{R\}=\\tfrac\{1\}\{B\}\\sum\_\{b\}\\bar\{D\}\(\\mathcal\{R\}^\{\(b\)\}\),\\;\\sigma\_\{R\}=\\mathrm\{std\}\_\{b\}\\\!\\big\[\\bar\{D\}\(\\mathcal\{R\}^\{\(b\)\}\)\\big\]\.\(3\)We require𝔼G\[Zspatial\(G\)\]<0\\mathbb\{E\}\_\{G\}\[Z\_\{\\mathrm\{spatial\}\}\(G\)\]<0\(more compact than random\)\.

#### \(B4\) Tertiary\-contact preference\.

Among edges withMe≥0\.5M\_\{e\}\\geq 0\.5, the fraction with sequence separation\|i−j\|\>20\|i\\\!\-\\\!j\|\>20should exceed that of the unimportant set, & the Cα\\alpha–Cα\\alphadistance distribution should peak in the catalytic\-triad regimedij∈\[6,10\]d\_\{ij\}\\in\[6,10\]\\,Å\.

The remainder of the paper presents architectures \(Section[4](https://arxiv.org/html/2605.10985#S4)\) that parameterisefθf\_\{\\theta\}at varying levels of structural expressivity, results on theℒcls\\mathcal\{L\}\_\{\\mathrm\{cls\}\}\-objective \(Section[5](https://arxiv.org/html/2605.10985#S5)\), & joint validation against both fidelity \(Section[6](https://arxiv.org/html/2605.10985#S6)\) & the biological criteria \(B1\)–\(B4\) which we report in Section[6\.2](https://arxiv.org/html/2605.10985#S6.SS2), all with the goal of obtaining a model that is simultaneously*accurate*&*auditable*\.

## 4Method

Protein SequenceS∈ΣNS\\in\\Sigma^\{N\}1\. ESM\-2 PLMDense Embeddings3D CoordinatesC∈ℝN×3C\\in\\mathbb\{R\}^\{N\\times 3\}Radius Graphε=8\\varepsilon=8ÅGINEConvBackbone2\. SoftBlobGINDifferentiable GS3\. InterpretableStructure & EC𝐗∈ℝ1318\\mathbf\{X\}\\in\\mathbb\{R\}^\{1318\}E,𝐄attrE,\\mathbf\{E\}\_\{\\mathrm\{attr\}\}H\(L\)H^\{\(L\)\}\{bk\}k=1K\\\{b\_\{k\}\\\}\_\{k=1\}^\{K\}Node Features \(1318\-d\)xi=\[ϕesm‖ϕphys‖ϕsasa∥…\]x\_\{i\}=\[\\,\\mathbf\{\\phi^\{\\mathrm\{esm\}\}\}\\,\\\|\\,\\phi^\{\\mathrm\{phys\}\}\\,\\\|\\,\\phi^\{\\mathrm\{sasa\}\}\\,\\\|\\,\\dots\\,\]GINE Message Passinghi\(ℓ\)=MLP\(\(1\+ϵ\)hi\(ℓ−1\)\+∑j∈𝒩\(i\)…\)h\_\{i\}^\{\(\\ell\)\}=\\mathrm\{MLP\}\\Big\(\(1\\\!\+\\\!\\epsilon\)h\_\{i\}^\{\(\\ell\-1\)\}\+\\sum\_\{j\\in\\mathcal\{N\}\(i\)\}\\dots\\Big\)Gumbel\-Softmax AssignmentAik=exp⁡\(\(Lik\+gik\)/τt\)∑k′=1Kexp⁡\(\(Lik′\+gik′\)/τt\)A\_\{ik\}=\\frac\{\\exp\(\(L\_\{ik\}\+g\_\{ik\}\)/\\tau\_\{t\}\)\}\{\\sum\_\{k^\{\\prime\}=1\}^\{K\}\\exp\(\(L\_\{ik^\{\\prime\}\}\+g\_\{ik^\{\\prime\}\}\)/\\tau\_\{t\}\)\}Learned Blob Partitionsbk=gψ\(LN\(∑iAikhi\(L\)∑iAik\+ϵ\)\)b\_\{k\}=g\_\{\\psi\}\\\!\\left\(\\mathrm\{LN\}\\\!\\left\(\\frac\{\\sum\_\{i\}A\_\{ik\}h\_\{i\}^\{\(L\)\}\}\{\\sum\_\{i\}A\_\{ik\}\+\\epsilon\}\\right\)\\right\)Dual Graph Embeddingz\(G\)=\[maxk⁡bk∥1N∑ihi\(L\)\]z\(G\)=\\big\[\\,\\max\_\{k\}b\_\{k\}\\,\\\|\\,\\frac\{1\}\{N\}\\sum\_\{i\}h\_\{i\}^\{\(L\)\}\\big\]Explanation ObjectiveminM,F−log\[fθ\(GM\)\]y^\+λ1∥M∥1…\\min\_\{M,F\}\-\\log\[f\_\{\\theta\}\(G\_\{M\}\)\]\_\{\\hat\{y\}\}\+\\lambda\_\{1\}\\\|M\\\|\_\{1\}\\dots\(Recovers Active Sites & Catalytic Triads\)\(a\) PLM\-to\-Graph Projection\(b\) Differentiable Blob Partitioning\(c\) Interpretable Structural Readout

Figure 1:Overview of the SoftBlobGIN Framework\.Our pipeline acts as an interpretable structural companion to protein language models\.\(a\)Dense, opaque ESM\-2 representations \(ϕesm\\phi^\{\\mathrm\{esm\}\}\) are concatenated with explicit structural/physicochemical features and projected onto a 3D contact graph\.\(b\)A lightweight, differentiable Gumbel\-Softmax \(GS\) pooling head learns to softly partition residues into functional substructures \(blobs\)\.\(c\)The resulting dual\-readout graph embedding enables high\-accuracy EC classification while seamlessly supporting post\-hoc attribution methods to extract biologically faithful motifs, such as catalytic triads and buried active sites\.### 4\.1Architecture

Our model has three components: \(i\) a GIN backbone that performs message\-passing over the protein contact graph, \(ii\) a Gumbel\-softmax blob\-pooling head that partitions residues intoKKlearned substructures, & \(iii\) a classifier that reads out from both blob\-level & global representations\. We describe each below; full hyperparameters & baseline architectures are in Appendix[D](https://arxiv.org/html/2605.10985#A4)\.

#### GIN backbone\.

Each residue’s 1318\-d feature vector \(Appendix[B](https://arxiv.org/html/2605.10985#A2)\) is projected to a hidden dimensionℏ=256\\hbar=256via a linear layer\. We then applyL=4L=4GINEConv layers\(Huet al\.,[2019](https://arxiv.org/html/2605.10985#bib.bib1)\), each parameterised by a 2\-layer MLP with BatchNorm & ReLU\. Edge features \(de=18d\_\{e\}=18, encoding radial\-basis\-expanded Cα\\alphadistances & sequence separation; see Appendix[B](https://arxiv.org/html/2605.10985#A2)\) are injected at every layer\. AfterLLlayers, each residue carries aℏ\\hbar\-dimensional representationhi\(L\)h\_\{i\}^\{\(L\)\}that integrates information from its local structural neighbourhood\.

#### Differentiable blob pooling \(SoftBlobGIN\)\.

To obtain an interpretable graph\-level representation, we partition residues intoK=8K\{=\}8soft, learned substructures via Gumbel\-softmax assignment\(Janget al\.,[2016](https://arxiv.org/html/2605.10985#bib.bib2)\)\. An MLP headfϕ:ℝℏ→ℝKf\_\{\\phi\}:\\mathbb\{R\}^\{\\hbar\}\\to\\mathbb\{R\}^\{K\}produces assignment logitsLikL\_\{ik\}for each residueii& blobkk\. With Gumbel noisegik∼iidGumbel\(0,1\)g\_\{ik\}\\stackrel\{\{\\scriptstyle\\text\{iid\}\}\}\{\{\\sim\}\}\\mathrm\{Gumbel\}\(0,1\)& temperatureτt\\tau\_\{t\}annealed linearly from1\.01\.0to0\.10\.1across training:

Aik\(τt\)=exp⁡\(\(Lik\+gik\)/τt\)∑k′=1Kexp⁡\(\(Lik′\+gik′\)/τt\),∑kAik=1\.A\_\{ik\}\(\\tau\_\{t\}\)\\;=\\;\\frac\{\\exp\\big\(\(L\_\{ik\}\+g\_\{ik\}\)/\\tau\_\{t\}\\big\)\}\{\\sum\_\{k^\{\\prime\}=1\}^\{K\}\\exp\\big\(\(L\_\{ik^\{\\prime\}\}\+g\_\{ik^\{\\prime\}\}\)/\\tau\_\{t\}\\big\)\},\\qquad\\sum\_\{k\}A\_\{ik\}=1\.\(4\)Each blob embedding is the assignment\-weighted mean of node representations, refined by a 2\-layer MLPgψg\_\{\\psi\}with LayerNorm:

bk=gψ\(LN\(∑i=1NAikhi\(L\)∑i=1NAik\+ϵ\)\)∈ℝℏ,k=1,…,K\.b\_\{k\}\\;=\\;g\_\{\\psi\}\\\!\\Bigg\(\\mathrm\{LN\}\\\!\\bigg\(\\frac\{\\sum\_\{i=1\}^\{N\}A\_\{ik\}\\,h\_\{i\}^\{\(L\)\}\}\{\\sum\_\{i=1\}^\{N\}A\_\{ik\}\+\\epsilon\}\\bigg\)\\Bigg\)\\in\\mathbb\{R\}^\{\\hbar\},\\quad k=1,\\dots,K\.\(5\)This replaces BioBlobs’ GVP encoder & VQ codebook\(Wang and Oliver,[2025](https://arxiv.org/html/2605.10985#bib.bib13)\)with a single MLP assignment head, reducing the pooling module to∼\{\\sim\}35K parameters while retaining differentiable, non\-overlapping substructure discovery\.

#### Readout & classifier\.

The graph embedding concatenates a blob\-level max\-pool with a global mean\-pool:

z\(G\)=\[maxk∈\[K\]⁡bk∥1N∑ihi\(L\)\]∈ℝ2ℏ\.z\(G\)\\;=\\;\\big\[\\,\\max\_\{k\\in\[K\]\}b\_\{k\}\\;\\\|\\;\\tfrac\{1\}\{N\}\\textstyle\\sum\_\{i\}h\_\{i\}^\{\(L\)\}\\,\\big\]\\;\\in\\mathbb\{R\}^\{2\\hbar\}\.\(6\)A 2\-layer MLP with BatchNorm mapsz\(G\)z\(G\)to class logits\. The full model has∼\{\\sim\}1\.1M trainable parameters; Algorithm[1](https://arxiv.org/html/2605.10985#alg1)\(Appendix[A](https://arxiv.org/html/2605.10985#A1)\) gives the complete forward pass\.

#### Baselines\.

To isolate the contribution of graph structure, we compare against two non\-graph baselines \(Seq MLP on amino\-acid composition; Residue MLP on mean\-pooled 1318\-d features\) & a plain GIN without blob pooling, which instead uses JumpingKnowledge concatenation\(Xuet al\.,[2018b](https://arxiv.org/html/2605.10985#bib.bib16)\)with dual mean–max readout \(∼\{\\sim\}1\.4M parameters\)\. All share the same feature pipeline\. Details are in Appendix[D](https://arxiv.org/html/2605.10985#A4)\.

#### Task\-specific heads\.

For graph\-level tasks \(EC classification, Gene Ontology, Protein Family\), we use SoftBlobGIN with the blob\-pooling readout above\. For node\-level tasks \(binding\-site detection\), blob pooling is inapplicable because it produces graph\-level rather than residue\-level representations; we therefore use the GIN backbone directly with a per\-residue MLP head on theJumpingKnowledgeoutput\. We note this distinction explicitly: results on node\-level tasks reflect the GIN backbone’s message\-passing, not the blob\-pooling module\. For pairwise tasks \(structure similarity, PPI\), we use Siamese or bilinear heads on the SoftBlobGIN encoder \(Appendix[D](https://arxiv.org/html/2605.10985#A4)\)\.

### 4\.2Explainability & training

We apply the explanation framework to the trained GIN classifier using both GNNExplainer & Integrated Gradients \(IG\)\. For GNNExplainer, we optimise continuous edge & feature masks

\(M∗,F∗\)=argminM,F−log\[fθ\(GM\)\]y^\+λ1‖M‖1\|E\|\+λ2ℋ\(M\)\+λ3‖F‖1d\+λ4ℋ\(F\),\(M^\{\*\},F^\{\*\}\)=\\arg\\min\_\{M,F\}\-\\log\\big\[f\_\{\\theta\}\(G\_\{M\}\)\\big\]\_\{\\hat\{y\}\}\+\\lambda\_\{1\}\\frac\{\\\|M\\\|\_\{1\}\}\{\|E\|\}\+\\lambda\_\{2\}\\mathcal\{H\}\(M\)\+\\lambda\_\{3\}\\frac\{\\\|F\\\|\_\{1\}\}\{d\}\+\\lambda\_\{4\}\\mathcal\{H\}\(F\),\(7\)whereGMG\_\{M\}denotes the masked graph &ℋ\(⋅\)\\mathcal\{H\}\(\\cdot\)is the element\-wise entropy regulariser\. Integrated Gradients computes attribution scores by accumulating gradients along a linear path from baseline inputx′x^\{\\prime\}to inputxx:

IGi\(x\)=\(xi−xi′\)∫α=01∂f\(x′\+α\(x−x′\)\)∂xi𝑑α,\\mathrm\{IG\}\_\{i\}\(x\)=\(x\_\{i\}\-x\_\{i\}^\{\\prime\}\)\\int\_\{\\alpha=0\}^\{1\}\\frac\{\\partial f\(x^\{\\prime\}\+\\alpha\(x\-x^\{\\prime\}\)\)\}\{\\partial x\_\{i\}\}\\,d\\alpha,\(8\)approximated numerically with a 50\-step trapezoidal rule\. Per\-protein GNNExplainer optimisation runs for 300 gradient steps, after which we evaluate explanations using fidelity, stability, unfaithfulness, & characterization metrics\. Stability quantifies robustness to perturbations\.

#### Class\-imbalanced training\.

We minimiseℒcls\\mathcal\{L\}\_\{\\mathrm\{cls\}\}\(Eq\.[9](https://arxiv.org/html/2605.10985#S4.E9)\) with AdamW \(lr=10−3=10^\{\-3\}, weight decay=10−4=10^\{\-4\}\), cosine annealing with 10\-epoch warmup, gradient clipping at norm 1\.0, edge dropout \(5%\) & feature masking \(5%\) during training, up to 200 epochs with patience 30 on val loss\. For the ensemble, five SoftBlobGIN models with different seeds are averaged at the softmax level\. Because the class frequenciesπc=ℙ\(Y=c\)\\pi\_\{c\}=\\mathbb\{P\}\(Y\\\!=\\\!c\)are highly skewed \(π3/π7≈40\\pi\_\{3\}/\\pi\_\{7\}\\approx 40\), we minimise the focal loss\(Linet al\.,[2017](https://arxiv.org/html/2605.10985#bib.bib7)\)with label smoothingη=0\.05\\eta=0\.05:

ℒcls\(θ\)=𝔼\(G,y\)∼𝒟train\[−∑c∈𝒞αc\(1−pc\)γq~clog⁡pc\],\\mathcal\{L\}\_\{\\mathrm\{cls\}\}\(\\theta\)\\;=\\;\\mathbb\{E\}\_\{\(G,y\)\\sim\\mathcal\{D\}\_\{\\mathrm\{train\}\}\}\\\!\\Big\[\\,\-\\sum\_\{c\\in\\mathcal\{C\}\}\\alpha\_\{c\}\\,\\big\(1\-p\_\{c\}\\big\)^\{\\gamma\}\\,\\widetilde\{q\}\_\{c\}\\,\\log p\_\{c\}\\Big\],\(9\)wherepc=\[fθ\(G\)\]cp\_\{c\}=\[f\_\{\\theta\}\(G\)\]\_\{c\},αc∝1/πc\\alpha\_\{c\}\\propto 1/\\pi\_\{c\}are inverse\-frequency weights,γ=1\\gamma=1, and the smoothed target isq~c=\(1−η\)1\[c=y\]\+η\|𝒞\|−1𝟙\[c≠y\]\\widetilde\{q\}\_\{c\}=\(1\-\\eta\)\\,\\mathbbm\{1\}\[c=y\]\+\\tfrac\{\\eta\}\{\|\\mathcal\{C\}\|\-1\}\\mathbbm\{1\}\[c\\neq y\]\. At test time we report accuracy, macro\-F1 & macro\-AUROC\.

## 5Experiments

In this section, we run & demonstrate the performance of our method, against structurally less expressive baselines \(Section[5\.2](https://arxiv.org/html/2605.10985#S5.SS2)\), external baseline GNNs \(Section[5\.3](https://arxiv.org/html/2605.10985#S5.SS3)\) & across multiple tasks \(Section[5\.4](https://arxiv.org/html/2605.10985#S5.SS4)\)\. Lastly, we provide extensive ablation of each component of our method\.

### 5\.1Dataset & Metrics

We use the ProteinShake\(Kuceraet al\.,[2023](https://arxiv.org/html/2605.10985#bib.bib5)\), 15,603 PDB structures annotated with first\-level EC numbers\. Nodes are residues & edges connect Cα\\alphaatoms withinε=8\\varepsilon=8\\,Å\. We adopt ProteinShake’s random split with a strict sequence\-similarity thresholdθ=0\.7\\theta=0\.7, yielding 14,042 train / 780 val / 781 test proteins\. The dataset is severely imbalanced: EC 3 has 5,619 training samples, EC 7 only 139 \(40:1:\\\!1ratio\)\. We report accuracy, macro\-F1, & macro\-AUROC; macro\-F1 is our primary metric because it weights rare classes equally\. Experiments run on a single NVIDIA RTX 6000 Ada \(48 GB\)\.

### 5\.2Primary Results

Table[1](https://arxiv.org/html/2605.10985#S5.T1.fig1)compares SoftBlobGIN against three baselines of increasing structural expressivity \(defined in Appendix[D](https://arxiv.org/html/2605.10985#A4)\)\. The hierarchy reveals a clear pattern: the largest performance jump comes from adding ESM\-2 features \(Seq MLP→\\toResidue MLP:\+0\.199\+0\.199accuracy,\+0\.270\+0\.270macro\-F1\), not from adding graph structure \(Residue MLP→\\toGIN:\+0\.015\+0\.015accuracy,\+0\.002\+0\.002macro\-F1\)\. This confirms that frozen PLM embeddings are the dominant signal for graph\-level EC classification\. Nonetheless, graph structure is not redundant\. GIN achieves the highest single\-model accuracy \(0\.925\), & SoftBlobGIN achieves the highest single\-model macro\-F1 \(0\.876\), a metric that weights rare classes equally & is therefore more informative given the 40:1 class imbalance\. The 5\-seed ensemble stabilises minority\-class variance \(Appendix[E](https://arxiv.org/html/2605.10985#A5)\) & reaches 0\.928 accuracy & 0\.898 macro\-F1\. Crucially, SoftBlobGIN’s competitive performance comes*with*built\-in blob\-level interpretability \(Section[6\.3](https://arxiv.org/html/2605.10985#S6.SS3)\), whereas GIN’s marginally higher accuracy provides no such structural decomposition\.

ModelAccMac\. F1Mac\. AUROCParamsTimeSeq MLP0\.7110\.5970\.912140K54 sResidue MLP0\.9100\.8670\.9641\.2M32 sGIN0\.9250\.8690\.9691\.4M15 mSoftBlobGIN0\.9120\.8760\.9511\.1M27 mEns\. \(5×\\timesSoftBlobGIN\)0\.9280\.8980\.955––
Table 1:EC classification on the ProteinShake test set \(781 proteins, 7 classes\)\. Baselines are defined in Appendix[D](https://arxiv.org/html/2605.10985#A4)\. Macro\-F1 is the primary metric because it gives equal weight to each class despite the 40:1 imbalance\. The ensemble averages softmax outputs across five random seeds\.
### 5\.3Comparison with external baselines

We compare against two external structure\-based baselines, GearNet\(Zhanget al\.,[2022](https://arxiv.org/html/2605.10985#bib.bib15)\)& ProNet\(Wanget al\.,[2022](https://arxiv.org/html/2605.10985#bib.bib14)\), as well as a frozen ESM\-2 mean\-pooled linear probe, all evaluated on identical splits \(Table[2](https://arxiv.org/html/2605.10985#S5.T2.fig1)\)\. The ESM\-2 linear probe is a remarkably strong baseline:0\.8410\.841accuracy &0\.7870\.787macro\-F1 with only 9K trainable parameters, confirming that frozen PLM features already capture most of the EC signal\. GearNet & ProNet, despite being substantially larger \(16\.3M & 0\.73M parameters\) & designed for structure\-based protein learning, perform well below the linear probe on this split\. Neither architecture uses ESM\-2 embeddings, relying instead on coordinate\-derived features alone; their low scores reflect the difficulty of learning EC\-discriminative representations from structure without the evolutionary context that PLMs provide\. The linear probe thus establishes the ceiling of what ESM\-2 can achieve without structural reasoning \(∼0\.84\\sim\\\!0\.84accuracy\)\. SoftBlobGIN improves\+7\.1\+7\.1accuracy points &\+8\.9\+8\.9macro\-F1 points above this ceiling, & this gain comes*with*the interpretability machinery of Section[6](https://arxiv.org/html/2605.10985#S6)\.

ModelAccuracyMacro F1ParamsTimeGearNet\(Zhanget al\.,[2022](https://arxiv.org/html/2605.10985#bib.bib15)\)0\.5390\.19016\.3M63 mProNet\(Wanget al\.,[2022](https://arxiv.org/html/2605.10985#bib.bib14)\)0\.5580\.4390\.73M21 mESM\-2 linear probe \(mean\-pool\)0\.8410\.7879\.0K30 mSoftBlobGIN0\.9120\.8761\.1M27 mEnsemble \(5×\\timesSoftBlobGIN\)0\.9280\.898––
Table 2:External baseline comparison on ProteinShake EC using identical splits\. GearNet & ProNet use coordinate\-only features, whereas the linear probe & our models use frozen ESM\-2 embeddings\.
### 5\.4ProteinShake benchmark sweep

We apply the same architecture \(with task\-specific output heads: linear classifier, multi\-label sigmoid, scalar regression, node binary classifier, Siamese, bilinear PPI\) to all ProteinShake tasks \(Table[3](https://arxiv.org/html/2605.10985#S5.T3)\)\. The same SoftBlobGIN backbone is competitive across heterogeneous task types\. Binding\-site detection at the node level shows the largest delta over the linear probe \(\+0\.086\+0\.086AUROC, MCC0\.7830\.783\), confirming the\+9\.8\+9\.8\-point gap reported in Section[6\.1](https://arxiv.org/html/2605.10985#S6.SS1)& reflecting the strong inductive bias of message passing for residue\-level predictions\. Pairwise StructureSimilarity also shows a large gain \(\+0\.193\+0\.193Spearman\), as expected for a task that intrinsically depends on 3D structure\. ProteinFamily & StructuralClass are nearly saturated by the linear probe, indicating that ESM\-2 already encodes most of the fold\-classification signal & graph structure adds little beyond it\. LigandAffinity remains the weakest area in absolute terms, possibly reflecting that affinity depends on details beyond the residue contact graph\.

Table 3:ProteinShake benchmark \(random split\)\. Each cell reports the task’s official primary metric\. Best results are inbold; second\-best areunderlined\. ‘\-‘ denotes failed or inapplicable methods\. GearNet & ProNet use only coordinate features, while the linear probe & SoftBlobGIN use frozen ESM\-2 embeddings\.
### 5\.5Ablation Studies

To isolate the contribution of each feature group, we retrain SoftBlobGIN with progressively richer input representations \(Table[4](https://arxiv.org/html/2605.10985#S5.T4)\)\. One\-hot amino\-acid identity alone yields 0\.752 accuracy & 0\.624 macro\-F1\. Appending physicochemical properties, SASA/RSA, & edge features slightly*decreases*raw accuracy \(0\.722\) while improving macro\-F1 \(0\.653\), reflecting better calibration on rare classes at the expense of majority\-class accuracy\. The single largest jump occurs when frozen ESM\-2 embeddings are added: macro\-F1 rises from 0\.653 to 0\.853, accounting for approximately 85% of the total improvement\. The full feature set provides the highest AUROC \(0\.978\), though macro\-F1 dips marginally to 0\.846, suggesting mild redundancy between ESM\-2 & handcrafted features that the model resolves differently across metrics\. Additional hyperparameter sweeps over contact radius \(ε∈\{4,8,12\}\\varepsilon\\in\\\{4,8,12\\\}\\,Å, optimal at88Å\) & blob count \(K∈\{3,5,8,12\}K\\in\\\{3,5,8,12\\\}, optimal atK=8K\{=\}8\) are reported in Appendix[E](https://arxiv.org/html/2605.10985#A5)\. The 5\-seed ensemble adds\+0\.040\+0\.040macro\-F1 over the mean individual model\.

Table 4:Feature\-set ablation on SoftBlobGIN \(EC test set\)\. Each row cumulatively adds one feature group to the row above\. Dim is the resulting node\-feature dimensionality\. The ESM\-2 row produces the single largest gain \(\+0\.200\+0\.200macro\-F1\), confirming that frozen PLM embeddings are the dominant contributor to classification performance\.

## 6Interpretable Structural Explanations

### 6\.1When does structure help? Binding\-site as a clean test

The interpretability claim is only meaningful if a structurally aware GNN*adds*something ESM\-2 alone does not provide\. We test this on ProteinShake binding\-site detection \(n=465n\{=\}465test proteins with per\-residue ground\-truth labels\), comparing three approaches that all consume the same frozen ESM\-2 features \(Table[5](https://arxiv.org/html/2605.10985#S6.T5.8)\): unsupervised ESM\-2 attention; a per\-residue linear probe on ESM\-2; & the GIN backbone with message\-passing over the contact graph\. Blob pooling is bypassed for this node\-level task \(Section[4](https://arxiv.org/html/2605.10985#S4)\); the result reflects the GIN backbone alone\.

MethodPred AUROCTop\-10% Prec\.ESM\-2 attention \(unsupervised\)0\.634±0\.0950\.634\\pm 0\.0950\.331±0\.1870\.331\\pm 0\.187Linear probe on ESM\-20\.885±0\.1190\.885\\pm 0\.1190\.758±0\.2400\.758\\pm 0\.240GIN backbone \(ESM\-2 \+ graph\)0\.983±0\.040\\mathbf\{0\.983\\pm 0\.040\}0\.931±0\.153\\mathbf\{0\.931\\pm 0\.153\}Table 5:Binding\-site prediction quality onn=465n\{=\}465test proteins\. All methods use the same frozen ESM\-2 features & differ only in how those representations are aggregated\.The 9\.8\-point AUROC gap between the GIN & the linear probe \(0\.9830\.983vs0\.8850\.885\) is the central evidence that graph structure adds binding\-site information ESM\-2 alone cannot recover\. Both methods use identical features; the delta is wholly attributable to message\-passing over the contact graph\. ESM\-2 attention is much weaker still \(0\.6340\.634\); its high entropy \(5\.125\.12\) & low sequence bias \(0\.0680\.068\) confirm that attention is broadly diffuse rather than anchored to spatially specific binding pockets\. We use GNNExplainer & Integrated Gradients rather than vanilla gradient saliency, which is uninformative on the trained GNN due to saturation at high\-confidence residues \(Appendix[F](https://arxiv.org/html/2605.10985#A6)\)\.

#### When does structure not help?

As an out\-of\-distribution probe, we evaluated SoftBlobGIN on DeepLoc\-2\.1 subcellular localisation\(Thumuluriet al\.,[2022](https://arxiv.org/html/2605.10985#bib.bib11)\)\. ESM\-only reachesFmax=0\.688F\_\{max\}=0\.688; SoftBlobGIN reaches0\.6820\.682, indicating that adding the GNN does not improve over ESM\-only\. Localisation depends on signal peptides & transmembrane segments rather than residue contacts, consistent with the dichotomy that the structural companion adds value when the prediction is structurally mediated & contributes little when it is not\.

### 6\.2Biological co\-referencing of GNNExplainer outputs

![Refer to caption](https://arxiv.org/html/2605.10985v1/fig/4RSL_blobs.png)

4RSL EC 3: Blob 2 captures catalytic pocket

![Refer to caption](https://arxiv.org/html/2605.10985v1/fig/2VDG_blobs.png)

2VDG EC 6: Blob 4 localizes active\-site residues

![Refer to caption](https://arxiv.org/html/2605.10985v1/fig/6A18_blobs.png)

6A18 EC 7: Blob 1 spans transport core

Figure 2:Qualitative 3D case studies of learned SoftBlobGIN substructures\. Proteins are rendered with residues colored by blob assignment, while annotated active\-site residues are shown as magenta sticks\. Across single\-domain, multi\-domain, & translocase examples, one dominant learned blob consistently overlaps functional active region, supporting biological interpretability of learned partitions\.Predictive faithfulness is necessary but insufficient: explanations must also align with known enzyme biochemistry\. We evaluate the four biological\-faithfulness criteria from Section[3](https://arxiv.org/html/2605.10985#S3)on the top\-20% residues per protein ranked by GNNExplainer importance \(Table[6](https://arxiv.org/html/2605.10985#S6.T6)\)\.\(B1\)Top residues are enriched for known catalytic amino acids in the EC classes where these residues are catalytically established: Cys/His for oxidoreductases \(EC 1\), Ser/His/Asp for hydrolases \(EC 3\), & charged residues for translocases \(EC 7\)\.\(B2\)The empirical SASA gap is negative for all 7 EC classes; important residues are consistently more buried than unimportant ones, the structural signature of active\-site pockets\.\(B3\)The spatialzz\-score𝔼G\[Zspatial\(G\)\]<0\\mathbb\{E\}\_\{G\}\[Z\_\{\\mathrm\{spatial\}\}\(G\)\]<0across all classes, with important residues55–15%15\\%more compact in 3D than random subsets\.\(B4\)Among edges with mask values≥0\.5\\geq 0\.5, the Cα\\alpha–Cα\\alphadistance distribution peaks in\[6,10\]\[6,10\]\\,Å, the catalytic\-triad regime \(Ser–His∼6\{\\sim\}6Å, His–Asp∼7\{\\sim\}7Å in serine proteases\)\.

Table 6:Top\-ranked feature groups & enriched amino acids identified by GNNExplainer across EC classes\. Explanations recover class\-specific biochemical motifs & structural signatures\.
### 6\.3Learned blobs map to functional substructures

SoftBlobGIN’s blob assignments provide an explanation by construction, without any post\-hoc optimisation\. Across all test proteins, the model spontaneously decomposes each structure into a large, solvent\-exposed structural\-core blob \(102 to 335 residues, mean SASA0\.280\.28to0\.360\.36\) & several small, buried functional\-site blobs \(8 to 40 residues, mean SASA0\.100\.10to0\.170\.17\)\. This core\-versus\-functional separation is consistent across all 7 EC classes, with functional blobs2 to 2\.6×\\timesmore buriedthan the core blob, despite the model receiving no active\-site supervision during training \(per\-blob statistics in Appendix[G](https://arxiv.org/html/2605.10985#A7)\)\.

To test whether this decomposition captures functionally relevant substructures, we analyse blob importance scores for 59 EC test proteins with active\-site annotations obtained via the PDB→\\,\\to\\,UniProt features pipeline\. We define the importanceπt\\pi\_\{t\}of blobttas the fraction of readout dimensions in which blobttsupplies the argmax inzblob=maxk⁡bkz\_\{\\mathrm\{blob\}\}=\\max\_\{k\}b\_\{k\}\(Eq\.[5](https://arxiv.org/html/2605.10985#S4.E5)\), analogous to BioBlobs’ attention weight\. The blob containing the most annotated active\-site residues \(the*active blob*\) carries higher importance than competing blobs in62\.7% of proteins, with mean importance0\.2090\.209vs\.0\.1130\.113for other blobs \(1\.85×\\timesenrichment\)\. On average, the active blob ranks2\.97/82\.97/8\(top37%37\\%\), and the correlation betweenπt\\pi\_\{t\}rank and active\-site enrichment is statistically significant \(Spearmanρ=0\.339\\rho=0\.339,p=0\.009p=0\.009\)\. The effect is moderate because EC classification depends on both active\-site chemistry & global fold topology, but the signal is reproducible across the test set\. Together with the burial separation above, these analyses indicate that learned blobs capturebiologically meaningful substructuresrather than arbitrary partitions\. The partial overlap with Pfam/CATH domain boundaries \(Jaccard0\.3800\.380, decaying with domain count; Appendix[H](https://arxiv.org/html/2605.10985#A8)\) further suggests that blobs are optimised for the classification objective rather than recapitulating known domain segmentations\.

## 7Discussion & Limitations

#### ESM\-2 dominates EC classification, but not binding\-site prediction\.

A simple MLP on mean\-pooled ESM\-2 nearly matches the full GIN on EC classification \(0\.867 vs 0\.869 macro\-F1\)\. For first\-level EC classification, ESM\-2 has likely seen enough evolutionary diversity during pretraining to implicitly encode the relevant motifs\. The picture is*very different*on binding\-site detection \(Section[6\.1](https://arxiv.org/html/2605.10985#S6.SS1)\): ESM\-2 attention reaches only AUROC0\.6340\.634, an ESM\-2 linear probe reaches0\.8850\.885, & SoftBlobGIN reaches0\.9830\.983\. Using the same features, graph message passing alone adds\+9\.8\+9\.8absolute AUROC\. This is the cleanest evidence in the paper that structural reasoning is not redundant: it adds binding\-site\-relevant information that ESM\-2 alone cannot recover &, together with the biological\-faithfulness analyses, makes the prediction auditable\.

#### Interpretability is more than fidelity\.

Our explanations have imperfect Fid\-\(the model is too redundant to be “broken” by edge removal\) but excellent biological alignment \(catalytic residues, burial, spatial clustering, contact geometry\)\. We argue that biological co\-referencing is the more meaningful test for protein\-function explanation, & we encourage the community to evaluate explainers against domain\-derived priors rather than only against the explainer’s own optimisation objective\.

#### Limitations\.

The model exhibits substantial redundancy, driving Fid\-close to zero because multiple edge subsets can support the same prediction\. Gradient saturation at high\-confidence residues also limits vanilla input×\\,\\times\\,gradient saliency \(Appendix[F](https://arxiv.org/html/2605.10985#A6)\), motivating our use of GNNExplainer & Integrated Gradients\(Yuanet al\.,[2022](https://arxiv.org/html/2605.10985#bib.bib20)\)\. Biological validation is based on general biochemical priors rather than residue\-level catalytic annotations; incorporating curated supervision such as the Catalytic Site Atlas would enable finer\-grained evaluation\. Learned blobs capture features useful for EC classification but do not explicitly align with Pfam/CATH domains \(Jaccard0\.3800\.380; Appendix[H](https://arxiv.org/html/2605.10985#A8)\)\. SoftBlobGIN also tends to allocate5050–60%60\\%of residues to a dominant core blob, suggesting room for improved partition balance\.

## 8Conclusion

We presentedSoftBlobGIN, a lightweight graph neural network that learns differentiable protein substructure partitions\. Across the ProteinShake benchmark, the framework achieves strong classification performance while producing structurally grounded explanations:GNNExplainerrecovers catalytic residues, active\-site burial patterns, & contact geometry consistent with known enzyme biochemistry, & learned blobs spontaneously separate functional sites from structural scaffold without active\-site supervision\. The framework is plug\-&\-play \(no PLM retraining\), adds only∼\\sim1\.1M parameters, & generalises across ProteinShake tasks\. We position this not as a replacement for protein language models but as an*interpretable structural companion*that makes their predictions auditable for downstream scientific & clinical use\.

## References

- W\. Hu, B\. Liu, J\. Gomes, M\. Zitnik, P\. Liang, V\. Pande, and J\. Leskovec \(2019\)Strategies for pre\-training graph neural networks\.arXiv preprint arXiv:1905\.12265\.Cited by:[Appendix D](https://arxiv.org/html/2605.10985#A4.SS0.SSS0.Px2.p1.5),[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px1.p1.2),[§4\.1](https://arxiv.org/html/2605.10985#S4.SS1.SSS0.Px1.p1.7)\.
- E\. Jang, S\. Gu, and B\. Poole \(2016\)Categorical reparameterization with gumbel\-softmax\.arXiv preprint arXiv:1611\.01144\.Cited by:[item 2](https://arxiv.org/html/2605.10985#S1.I1.i2.p1.2),[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.10985#S4.SS1.SSS0.Px2.p1.9)\.
- J\. Jumper, R\. Evans, A\. Pritzel, T\. Green, M\. Figurnov, O\. Ronneberger, K\. Tunyasuvunakool, R\. Bates, A\. Žídek, A\. Potapenko,et al\.\(2021\)Highly accurate protein structure prediction with alphafold\.nature596\(7873\),pp\. 583–589\.Cited by:[§1](https://arxiv.org/html/2605.10985#S1.p1.1)\.
- T\. N\. Kipf and M\. Welling \(2016\)Semi\-supervised classification with graph convolutional networks\.arXiv preprint arXiv:1609\.02907\.Cited by:[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px1.p1.2)\.
- T\. Kucera, C\. Oliver, D\. Chen, and K\. Borgwardt \(2023\)Proteinshake: building datasets and benchmarks for deep learning on protein structures\.Advances in Neural Information Processing Systems36,pp\. 58277–58289\.Cited by:[§5\.1](https://arxiv.org/html/2605.10985#S5.SS1.p1.4)\.
- G\. Kustatscher, T\. Collins, A\. Gingras, T\. Guo, H\. Hermjakob, T\. Ideker, K\. S\. Lilley, E\. Lundberg, E\. M\. Marcotte, M\. Ralser,et al\.\(2022\)Understudied proteins: opportunities and challenges for functional proteomics\.Nature Methods19\(7\),pp\. 774–779\.Cited by:[§1](https://arxiv.org/html/2605.10985#S1.p1.1)\.
- T\. Lin, P\. Goyal, R\. Girshick, K\. He, and P\. Dollár \(2017\)Focal loss for dense object detection\.InProceedings of the IEEE international conference on computer vision,pp\. 2980–2988\.Cited by:[§4\.2](https://arxiv.org/html/2605.10985#S4.SS2.SSS0.Px1.p1.6)\.
- Z\. Lin, H\. Akin, R\. Rao, B\. Hie, Z\. Zhu, W\. Lu, N\. Smetanin, R\. Verkuil, O\. Kabeli, Y\. Shmueli,et al\.\(2023\)Evolutionary\-scale prediction of atomic\-level protein structure with a language model\.Science379\(6637\),pp\. 1123–1130\.Cited by:[4th item](https://arxiv.org/html/2605.10985#A2.I1.i4.p1.1),[§1](https://arxiv.org/html/2605.10985#S1.p1.1),[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px2.p1.1)\.
- P\. E\. Pope, S\. Kolouri, M\. Rostami, C\. E\. Martin, and H\. Hoffmann \(2019\)Explainability methods for graph convolutional neural networks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10772–10781\.Cited by:[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px4.p1.1)\.
- M\. Sundararajan, A\. Taly, and Q\. Yan \(2017\)Axiomatic attribution for deep networks\.InInternational conference on machine learning,pp\. 3319–3328\.Cited by:[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px4.p1.1)\.
- V\. Thumuluri, J\. J\. Almagro Armenteros, A\. R\. Johansen, H\. Nielsen, and O\. Winther \(2022\)DeepLoc 2\.0: multi\-label subcellular localization prediction using protein language models\.Nucleic acids research50\(W1\),pp\. W228–W234\.Cited by:[§6\.1](https://arxiv.org/html/2605.10985#S6.SS1.SSS0.Px1.p1.2)\.
- P\. Veličković, G\. Cucurull, A\. Casanova, A\. Romero, P\. Lio, and Y\. Bengio \(2017\)Graph attention networks\.arXiv preprint arXiv:1710\.10903\.Cited by:[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px1.p1.2)\.
- L\. Wang, H\. Liu, Y\. Liu, J\. Kurtin, and S\. Ji \(2022\)Learning hierarchical protein representations via complete 3d graph networks\.arXiv preprint arXiv:2207\.12600\.Cited by:[§5\.3](https://arxiv.org/html/2605.10985#S5.SS3.p1.5),[Table 2](https://arxiv.org/html/2605.10985#S5.T2.1.1.4.3.1)\.
- X\. Wang and C\. Oliver \(2025\)BioBlobs: differentiable graph partitioning for protein representation learning\.arXiv preprint arXiv:2510\.01632\.Cited by:[§1](https://arxiv.org/html/2605.10985#S1.p2.1),[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.10985#S4.SS1.SSS0.Px2.p1.11)\.
- K\. Xu, W\. Hu, J\. Leskovec, and S\. Jegelka \(2018a\)How powerful are graph neural networks?\.arXiv preprint arXiv:1810\.00826\.Cited by:[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px1.p1.2)\.
- K\. Xu, C\. Li, Y\. Tian, T\. Sonobe, K\. Kawarabayashi, and S\. Jegelka \(2018b\)Representation learning on graphs with jumping knowledge networks\.InInternational conference on machine learning,pp\. 5453–5462\.Cited by:[Appendix D](https://arxiv.org/html/2605.10985#A4.SS0.SSS0.Px2.p1.5),[§4\.1](https://arxiv.org/html/2605.10985#S4.SS1.SSS0.Px4.p1.1)\.
- Z\. Ying, D\. Bourgeois, J\. You, M\. Zitnik, and J\. Leskovec \(2019\)Gnnexplainer: generating explanations for graph neural networks\.Advances in neural information processing systems32\.Cited by:[item 3](https://arxiv.org/html/2605.10985#S1.I1.i3.p1.3),[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px4.p1.1)\.
- Z\. Ying, J\. You, C\. Morris, X\. Ren, W\. Hamilton, and J\. Leskovec \(2018\)Hierarchical graph representation learning with differentiable pooling\.Advances in neural information processing systems31\.Cited by:[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Yuan, H\. Yu, S\. Gui, and S\. Ji \(2022\)Explainability in graph neural networks: a taxonomic survey\.IEEE transactions on pattern analysis and machine intelligence45\(5\),pp\. 5782–5799\.Cited by:[§2](https://arxiv.org/html/2605.10985#S2.SS0.SSS0.Px4.p1.1),[§7](https://arxiv.org/html/2605.10985#S7.SS0.SSS0.Px3.p1.5)\.
- Z\. Zhang, M\. Xu, A\. Jamasb, V\. Chenthamarakshan, A\. Lozano, P\. Das, and J\. Tang \(2022\)Protein representation learning by geometric structure pretraining\.arXiv preprint arXiv:2203\.06125\.Cited by:[§5\.3](https://arxiv.org/html/2605.10985#S5.SS3.p1.5),[Table 2](https://arxiv.org/html/2605.10985#S5.T2.1.1.3.2.1)\.

## Appendix AAlgorithmic Details

Algorithm 1SoftBlobGIN forward pass1:Graph

G=\(V,E,𝐗,𝐄attr\)G=\(V,E,\\mathbf\{X\},\\mathbf\{E\}\_\{\\mathrm\{attr\}\}\), temperature

τ\\tau
2:

H\(0\)←Linear\(𝐗\)H^\{\(0\)\}\\leftarrow\\mathrm\{Linear\}\(\\mathbf\{X\}\)
3:for

ℓ=1,…,L\\ell=1,\\dots,Ldo

4:

H\(ℓ\)←GINEConv\(H\(ℓ−1\),E,𝐄attr\)H^\{\(\\ell\)\}\\leftarrow\\mathrm\{GINEConv\}\(H^\{\(\\ell\-1\)\},E,\\mathbf\{E\}\_\{\\mathrm\{attr\}\}\)
5:

H\(ℓ\)←Drop\(ReLU\(BN\(H\(ℓ\)\)\)\)H^\{\(\\ell\)\}\\leftarrow\\mathrm\{Drop\}\(\\mathrm\{ReLU\}\(\\mathrm\{BN\}\(H^\{\(\\ell\)\}\)\)\)
6:endfor

7:

L←BlobHead\(H\(L\)\)L\\leftarrow\\mathrm\{BlobHead\}\(H^\{\(L\)\}\);

A←GumbelSoftmax\(L,τ\)A\\leftarrow\\mathrm\{GumbelSoftmax\}\(L,\\tau\)
8:for

k=1,…,Kk=1,\\dots,Kdo

9:

bk←∑iAikhi\(L\)/\(∑iAik\+ϵ\)b\_\{k\}\\leftarrow\\sum\_\{i\}A\_\{ik\}h\_\{i\}^\{\(L\)\}/\(\\sum\_\{i\}A\_\{ik\}\+\\epsilon\);

bk←LN\(MLP\(bk\)\)b\_\{k\}\\leftarrow\\mathrm\{LN\}\(\\mathrm\{MLP\}\(b\_\{k\}\)\)
10:endfor

11:

zblob←maxk⁡bkz\_\{\\mathrm\{blob\}\}\\leftarrow\\max\_\{k\}b\_\{k\};

zglobal←1N∑ihi\(L\)z\_\{\\mathrm\{global\}\}\\leftarrow\\tfrac\{1\}\{N\}\\sum\_\{i\}h\_\{i\}^\{\(L\)\}
12:return

Classifier\(\[zblob∥zglobal\]\)\\mathrm\{Classifier\}\(\[z\_\{\\mathrm\{blob\}\}\\,\\\|\\,z\_\{\\mathrm\{global\}\}\]\)

## Appendix BNode Features

Each residueiicarries add\-dimensional feature vectorxi∈ℝdx\_\{i\}\\in\\mathbb\{R\}^\{d\}formed by concatenating six blocks:

xi=\[ϕaa\(si\)‖ϕphys\(si\)‖ϕsasa\(i\)‖ϕesm\(S\)i‖ϕdeg\(i\)∥ϕpos\(i,N\)\],x\_\{i\}\\;=\\;\\big\[\\,\\phi^\{\\mathrm\{aa\}\}\(s\_\{i\}\)\\,\\\|\\,\\phi^\{\\mathrm\{phys\}\}\(s\_\{i\}\)\\,\\\|\\,\\phi^\{\\mathrm\{sasa\}\}\(i\)\\,\\\|\\,\\phi^\{\\mathrm\{esm\}\}\(S\)\_\{i\}\\,\\\|\\,\\phi^\{\\mathrm\{deg\}\}\(i\)\\,\\\|\\,\\phi^\{\\mathrm\{pos\}\}\(i,N\)\\,\\big\],\(10\)with dimensions20\+10\+2\+1280\+1\+5=131820\+10\+2\+1280\+1\+5=1318\. Specifically:

- •ϕaa\(s\)∈\{0,1\}20\\phi^\{\\mathrm\{aa\}\}\(s\)\\in\\\{0,1\\\}^\{20\}is the one\-hot encoding;
- •ϕphys\(s\)∈\[0,1\]10\\phi^\{\\mathrm\{phys\}\}\(s\)\\in\[0,1\]^\{10\}is the min–max normalised vector of \(Kyte–Doolittle hydrophobicity, charge, MW, vdW volume, Grantham polarity, Vihinen flexibility, accessibility, helix\-/sheet\-/ turn\-propensity\);
- •ϕsasa\(i\)=\(SASA~i,RSAi\)∈\[0,1\]2\\phi^\{\\mathrm\{sasa\}\}\(i\)=\(\\widetilde\{\\mathrm\{SASA\}\}\_\{i\},\\mathrm\{RSA\}\_\{i\}\)\\in\[0,1\]^\{2\};
- •ϕesm:Σ∗→ℝN×1280\\phi^\{\\mathrm\{esm\}\}:\\Sigma^\{\*\}\\to\\mathbb\{R\}^\{N\\times 1280\}is the ESM\-2 \(650M, layer 33\) per\-residue extractor\[Linet al\.,[2023](https://arxiv.org/html/2605.10985#bib.bib8)\], treated as a frozen black box;
- •ϕdeg\(i\)=deg⁡\(i\)/maxj⁡deg⁡\(j\)∈\[0,1\]\\phi^\{\\mathrm\{deg\}\}\(i\)=\\deg\(i\)/\\max\_\{j\}\\deg\(j\)\\in\[0,1\];
- •ϕpos\(i,N\)=\(iN,sin⁡iπN,cos⁡iπN,sin⁡2iπN,cos⁡2iπN\)\\phi^\{\\mathrm\{pos\}\}\(i,N\)=\\big\(\\tfrac\{i\}\{N\},\\sin\\tfrac\{i\\pi\}\{N\},\\cos\\tfrac\{i\\pi\}\{N\},\\sin\\tfrac\{2i\\pi\}\{N\},\\cos\\tfrac\{2i\\pi\}\{N\}\\big\)\.

We write𝐗∈ℝN×d\\mathbf\{X\}\\in\\mathbb\{R\}^\{N\\times d\}for the stack ofxix\_\{i\}\.

## Appendix CMetrics

We measure explanation quality via four scalar functionals on𝒟test\\mathcal\{D\}\_\{\\mathrm\{test\}\}\. Let𝒯s\(M\)\\mathcal\{T\}\_\{s\}\(M\)denote the binary mask that retains only the top⌈s\|E\|⌉\\lceil s\|E\|\\rceiledges ofMM, & let𝟏−𝒯s\(M\)\\mathbf\{1\}\\\!\-\\\!\\mathcal\{T\}\_\{s\}\(M\)be its complement\.

#### \(a\) Sparsity\.

Sp\(M\)=\|E\|−1∑e𝟙\[Me<0\.5\]\.\\mathrm\{Sp\}\(M\)=\|E\|^\{\-1\}\\sum\_\{e\}\\mathbbm\{1\}\[M\_\{e\}<0\.5\]\.

#### \(b\) Sufficiency \(Fidelity\+\+\)\.

Predictive agreement when keeping only the top edges:

Fid\+\(s\)=𝔼\(G,y\)\[1\[arg⁡max⁡fθ\(G𝒯s\(M\)\)=arg⁡max⁡fθ\(G\)\]\]\.\\mathrm\{Fid\}^\{\+\}\(s\)\\;=\\;\\mathbb\{E\}\_\{\(G,y\)\}\\\!\\Big\[\\,\\mathbbm\{1\}\\big\[\\,\\arg\\max f\_\{\\theta\}\(G\_\{\\mathcal\{T\}\_\{s\}\(M\)\}\)=\\arg\\max f\_\{\\theta\}\(G\)\\,\\big\]\\,\\Big\]\.\(11\)

#### \(c\) Necessity \(Fidelity−\-\)\.

Predictive breakage when removing the top edges:

Fid−\(s\)=1−𝔼\(G,y\)\[1\[arg⁡max⁡fθ\(G𝟏−𝒯s\(M\)\)=arg⁡max⁡fθ\(G\)\]\]\.\\mathrm\{Fid\}^\{\-\}\(s\)\\;=\\;1\-\\mathbb\{E\}\_\{\(G,y\)\}\\\!\\Big\[\\,\\mathbbm\{1\}\\big\[\\,\\arg\\max f\_\{\\theta\}\(G\_\{\\mathbf\{1\}\-\\mathcal\{T\}\_\{s\}\(M\)\}\)=\\arg\\max f\_\{\\theta\}\(G\)\\,\\big\]\\,\\Big\]\.\(12\)

#### \(d\) Class stability\.

Mean intra\-class cosine similarity of feature masks:

Stab\(c\)=2\|𝒮c\|\(\|𝒮c\|−1\)∑\(F\(a\),F\(b\)\)∈𝒮ca<b⟨F\(a\),F\(b\)⟩‖F\(a\)‖‖F\(b\)‖,\\mathrm\{Stab\}\(c\)\\;=\\;\\frac\{2\}\{\|\\mathcal\{S\}\_\{c\}\|\(\|\\mathcal\{S\}\_\{c\}\|\-1\)\}\\sum\_\{\\begin\{subarray\}\{c\}\(F^\{\(a\)\},F^\{\(b\)\}\)\\in\\mathcal\{S\}\_\{c\}\\\\ a<b\\end\{subarray\}\}\\frac\{\\langle F^\{\(a\)\},F^\{\(b\)\}\\rangle\}\{\\\|F^\{\(a\)\}\\\|\\,\\\|F^\{\(b\)\}\\\|\},\(13\)where𝒮c=\{F\(n\):y\(n\)=c\}\\mathcal\{S\}\_\{c\}=\\\{F^\{\(n\)\}:y^\{\(n\)\}=c\\\}\.

## Appendix DBaseline Methods

We evaluate four baseline GNNs of increasing structural expressivity \(Table[7](https://arxiv.org/html/2605.10985#A4.T7)\)\. All four share the feature pipeline of Eq\.[10](https://arxiv.org/html/2605.10985#A2.E10)so that differences in performance reflect differences in topological inductive bias\.

Table 7:Model hierarchy by structural expressivity\.#### Seq MLP & Residue MLP\.

The first two baselines deliberately*discard*graph structure to isolate the contribution of the contact graph\. Seq MLP uses only the bag of amino acids vectorx¯aa=1N∑iϕaa\(si\)\\bar\{x\}^\{\\mathrm\{aa\}\}=\\tfrac\{1\}\{N\}\\sum\_\{i\}\\phi^\{\\mathrm\{aa\}\}\(s\_\{i\}\)as input, while Residue MLP uses the full mean\-pooled featurex¯=1N∑ixi∈ℝ1318\\bar\{x\}=\\tfrac\{1\}\{N\}\\sum\_\{i\}x\_\{i\}\\in\\mathbb\{R\}^\{1318\}\. Both end in a 3\-layer MLP with LayerNorm, GELU & dropout 0\.3\.

#### GIN\.

We instantiategθg\_\{\\theta\}withL=4L=4GINEConv\[Huet al\.,[2019](https://arxiv.org/html/2605.10985#bib.bib1)\]layers of widthℏ=256\\hbar=256, BatchNorm & ReLU between layers, JumpingKnowledge concatenation\[Xuet al\.,[2018b](https://arxiv.org/html/2605.10985#bib.bib16)\]\(ℏL=1024\\hbar L=1024per node\), & a dual mean\-max readout to a 2048\-dim graph embedding\. A two\-layer MLP head with BatchNorm produces the final logits\. Total:∼\\sim1\.4M parameters\.

#### SoftBlobGIN\.

The final architecture replaces JumpingKnowledge & the dual readout with the differentiable soft\-blob pooling defined in Eq\. \([4](https://arxiv.org/html/2605.10985#S4.E4)\) through \([6](https://arxiv.org/html/2605.10985#S4.E6)\)\. We useK=8K=8blobs & annealτ\\taulinearly from1\.01\.0to0\.10\.1across training epochs\. The blob refinementgψg\_\{\\psi\}is a 2\-layer MLP with LayerNorm\. This substitutes BioBlobs’ GVP encoder \+ VQ codebook with a single MLP head, at∼\\sim1\.1M total parameters\. Algorithm[1](https://arxiv.org/html/2605.10985#alg1)gives the full forward pass\.

## Appendix EHyperparameter Sweeps

We searched over contact radiusε∈\{4,8,12\}\\varepsilon\\in\\\{4,8,12\\\}\\,Å & blob countK∈\{3,5,8,12\}K\\in\\\{3,5,8,12\\\}on the EC validation split\. Performance peaked atε=8\\varepsilon=8\\,Å \(the standard choice for residue contact graphs\) &K=8K=8blobs, consistent with the intuition that enzymes have a small number of distinct functional substructures \(active site, cofactor pocket, substrate channel, scaffold\)\. AtK=3K=3the model under\-segments & core blobs absorb functional residues; atK=12K=12many blobs are empty after Gumbel annealing\. Performance is broadly stable in theK∈\[5,8\]K\\in\[5,8\]range\.

SeedAccuracyMacro F1AUROC420\.9140\.8900\.9431230\.9090\.8710\.9394560\.8900\.8240\.9487890\.9120\.8880\.94910240\.8980\.8150\.931Mean individual0\.9050\.8580\.942Ensemble0\.9280\.8980\.955Ensemble gain\+0\.024\+0\.024\+0\.040\+0\.040\+0\.013\+0\.013Table 8:5\-seed SoftBlobGIN ensemble on ProteinShake EC\.
## Appendix FExplainability Analysis

We report quantitative explanation metrics for SoftBlobGIN & Integrated Gradients \(IG\), evaluating sparsity, stability, & faithfulness of learned explanations on the ProteinShake EC benchmark\. Vanilla gradient saliency was excluded due to gradient saturation, producing noisy, poorly localized scores\.

#### Aggregate statistics\.

Table[9](https://arxiv.org/html/2605.10985#A6.T9)summarizes explanation quality for SoftBlobGIN\. Explanations are highly sparse \(mean=0\.814=0\.814\), retaining only a small fraction of residues with high attribution\. Masked confidence of0\.7360\.736indicates that these residues alone preserve most predictive signal\. Stability is consistently high across all seven EC classes \(0\.8630\.863–0\.9110\.911\), confirming reproducible explanations within each functional family\.

MetricValueMean sparsity0\.814Mean masked confidence↑\\uparrow0\.736Mean unfaithfulness↓\\downarrow0\.437Characterization score0\.116EC stability range↑\\uparrow0\.863–0\.911Table 9:Aggregate explanation statistics for SoftBlobGIN \(↑\\uparrowhigher is better;↓\\downarrowlower is better\)\.

#### Faithfulness under progressive masking\.

Table[10](https://arxiv.org/html/2605.10985#A6.T10)compares IG & SoftBlobGIN via Fidelity\+\+\(prediction preserved after retaining top residues\) & Fidelity−\-\(prediction degraded after removing them\)\. Both methods achieve Fidelity\+=1\.000\+\\\!=\\\!1\.000at≥20%\\geq\\\!20\\%sparsity, confirming that top\-ranked residues suffice to recover the original prediction\. Low Fidelity−\-across all levels reflects mild redundancy in residue importance, consistent across both methods\.

MethodSparsityFidelity\+\+↑\\uparrowFidelity−\-↑\\uparrowIntegrated Gradients0\.050\.050\.9780\.9780\.0000\.0000\.100\.100\.9890\.9890\.0110\.0110\.200\.201\.0001\.0000\.0110\.0110\.300\.301\.0001\.0000\.0220\.0220\.500\.501\.0001\.0000\.0220\.0220\.700\.701\.0001\.0000\.0220\.022SoftBlobGIN0\.050\.050\.9890\.9890\.0000\.0000\.100\.100\.9780\.9780\.0000\.0000\.200\.201\.0001\.0000\.0110\.0110\.300\.301\.0001\.0000\.0110\.0110\.500\.501\.0001\.0000\.0220\.0220\.700\.701\.0001\.0000\.0220\.022Table 10:Faithfulness comparison under progressive sparsification\. Fidelity\+\+↑\\uparrowmeasures retention of top residues; Fidelity−\-↑\\uparrowmeasures degradation upon their removal\.

## Appendix GBlob structural statistics & functional analysis

SoftBlobGIN provides explanations by construction through learned blob assignments, without requiring post\-hoc optimization or gradient\-based attribution\. To evaluate whether these assignments correspond to biologically meaningful structural units, we analyze solvent accessibility, active\-site enrichment, amino acid composition, spatial coherence, & agreement with post\-hoc explainers\.

#### Structural decomposition of learned blobs\.

Across all test proteins, SoftBlobGIN consistently partitions each protein into one large solvent\-exposed structural blob & several smaller buried blobs\. The dominant structural blob spans102102–335335residues with mean normalized solvent accessibility \(SASA\)0\.280\.28–0\.360\.36, whereas smaller blobs contain88–4040residues with mean SASA0\.100\.10–0\.170\.17\.

Functional blobs are therefore approximately22–2\.6×2\.6\\timesmore buried than the structural core blob, despite the model receiving no active\-site supervision during training\.

Table 11:Aggregate structural statistics of learned blob assignments\.Figure[3](https://arxiv.org/html/2605.10985#A7.F3)shows solvent accessibility profiles across EC classes\. All classes exhibit a consistent pattern with one dominant surface\-exposed blob & multiple buried blobs\.

![Refer to caption](https://arxiv.org/html/2605.10985v1/fig/fig_blob_sasa_profiles.png)Figure 3:Mean normalized solvent accessibility \(SASA\) of learned blobs across EC classes\. Lower values indicate more buried, active\-site\-like regions\.
#### Active\-site blob enrichment\.

To test whether learned blobs capture functionally relevant substructures, we analyze proteins with active\-site annotations obtained through the PDB\-to\-UniProt features pipeline\.

Blob importance is defined as the fraction of readout dimensions for which blobttprovides the argmax in the pooled representationzblob=maxk⁡bkz\_\{\\mathrm\{blob\}\}=\\max\_\{k\}b\_\{k\}\.

Across proteins with active\-site annotations, blobs containing annotated active\-site residues carry1\.85×1\.85\\timeshigher importance than other blobs on average, rank in the top 3 of 8, & show significant correlation between importance rank & active\-site enrichment \(ρ=0\.339\\rho=0\.339,p=0\.009p=0\.009\)\.

![Refer to caption](https://arxiv.org/html/2605.10985v1/fig/fig_importance_vs_active_site.png)Figure 4:Blob importance analysis for proteins with active\-site annotations\. Left: rank distribution of active\-site blobs\. Middle: comparison of active\-site blob importance versus other blobs\. Right: mean importance advantage by EC class\.
#### Amino acid enrichment of learned blobs\.

To assess whether blobs capture chemically meaningful residue composition, we compute amino acid enrichment relative to background frequency for each blob\. Figure[5](https://arxiv.org/html/2605.10985#A7.F5)shows that several compact blobs exhibit distinctive residue preferences, including enrichment of residue types commonly associated with catalytic activity\.

![Refer to caption](https://arxiv.org/html/2605.10985v1/fig/fig_blob_aa_enrichment.png)Figure 5:Per\-blob amino acid enrichment relative to background amino acid frequencies\. Stars indicate residues commonly associated with catalytic activity\.
#### Spatial coherence of learned blobs\.

A useful structural decomposition should assign spatially nearby residues to the same blob\. Figure[6](https://arxiv.org/html/2605.10985#A7.F6)shows mean intra\-blob Cα\\alphadistances across EC classes\.

Learned blobs are spatially compact across all classes, indicating that assignments are not arbitrary graph partitions\.

![Refer to caption](https://arxiv.org/html/2605.10985v1/fig/fig_blob_spatial_coherence.png)Figure 6:Mean intra\-blob Cα\\alphadistance across EC classes\. Lower distance indicates greater spatial compactness\.
#### Interpretation of Agreement with post\-hoc explainers\.

We compare SoftBlobGIN blob assignments with GNNExplainer edge saliency\. Figure[7](https://arxiv.org/html/2605.10985#A7.F7)shows modest but non\-zero agreement between top\-ranked blobs & GNNExplainer important residues, indicating partial overlap between intrinsic & post\-hoc explanations\. Together, solvent accessibility, active\-site enrichment, amino acid composition, spatial coherence, & agreement with post\-hoc explainers indicate that SoftBlobGIN learns biologically meaningful structural partitions rather than arbitrary graph clusters\.

![Refer to caption](https://arxiv.org/html/2605.10985v1/fig/fig_gin_blob_overlap.png)Figure 7:Agreement between SoftBlobGIN blob assignments & GNNExplainer important residues\. Left: per\-EC Jaccard overlap\. Right: distribution of overlap scores across proteins\.![Refer to caption](https://arxiv.org/html/2605.10985v1/fig/fig_edge_importance_examples.png)Figure 8:Representative GNNExplainer edge\-importance maps across EC classes\. Darker edges indicate higher importance\.

## Appendix HRelationship to known domain boundaries

To evaluate whether learned blobs merely reproduce known domain segmentations, we compare blob assignments against annotated Pfam/CATH domains\. Blob\-domain agreement is partial, with mean Jaccard overlap0\.3800\.380\. This indicates that learned blobs are related to known domain boundaries but do not simply recover canonical domain annotations\.

This behavior is expected: domains represent evolutionary & structural units, whereas SoftBlobGIN blobs are optimized directly for enzyme classification\. Learned blobs therefore partially align with known domains while deviating when functional discrimination benefits from finer\-grained or cross\-domain partitions\. Finally, representative visualizations of the GNNExplainer are shown in Figure[8](https://arxiv.org/html/2605.10985#A7.F8)\.

## Appendix ISocietal Impact

Our framework is designed to make protein function predictions more transparent by providing structurally grounded explanations alongside classifications\. This has positive implications for drug discovery, enzyme engineering, & clinical genomics, where interpretability supports regulatory compliance & scientific trust\. We do not foresee direct negative societal impacts: the method operates on publicly available protein structures & does not generate novel sequences or designs that could be misused\. All data used in this work \(ProteinShake, ESM\-2 embeddings, PDB/UniProt annotations\) are publicly released under permissive licenses, & our code & model checkpoints carry no dual\-use risk beyond standard protein classification tools\.
Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning

Similar Articles

ProtSent: Protein Sentence Transformers

ProtStructQA: A Denotation Threshold in Protein Structural Reasoning

A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?

PairSAE: Mechanistic Interpretability from Pair Representations in Protein Co-Folding

Co-folding model guided by structural proteomics

Submit Feedback

Similar Articles

ProtSent: Protein Sentence Transformers
ProtStructQA: A Denotation Threshold in Protein Structural Reasoning
A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?
PairSAE: Mechanistic Interpretability from Pair Representations in Protein Co-Folding
Co-folding model guided by structural proteomics