ProtSent: Protein Sentence Transformers

arXiv cs.LG 05/11/26, 04:00 AM Papers
protein-ai bioinformatics contrastive-learning sentence-transformers embeddings research
Summary
This article introduces ProtSent, a contrastive fine-tuning framework for protein language models that improves embedding quality for downstream tasks like remote homology detection and structural retrieval.
arXiv:2605.06830v1 Announce Type: new Abstract: Protein language models (pLMs) produce per-residue representations that capture evolutionary and structural information, yet their mean-pooled sequence embeddings are not explicitly trained to reflect functional, evolutionary or structural similarity between proteins. We present Protein Sentence Transformers (ProtSent), a contrastive fine-tuning framework for adapting PLMs into general-purpose embedding models. ProtSent trains with MultipleNegativesRankingLoss across five protein-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, and StringDB protein--protein interactions, and Deep Mutational Scanning data. We evaluate on 23~downstream tasks using frozen embeddings with a k-nearest-neighbor probe to measure embedding neighborhood quality. On ESM-2 150M, ProtSent improves 15 of 23 tasks, with gains of +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40 structural retrieval. The 35M variant improves 16 of 23 tasks with +40.5% on remote homology and +15.5% Recall@1 on SCOPe-40. Contrastive fine-tuning restructures the embedding space to better capture protein function and structure, without any task-specific supervision. We release the models, public data, and training recipe and code.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/11/26, 06:54 AM
# ProtSent: Protein Sentence Transformers
Source: [https://arxiv.org/html/2605.06830](https://arxiv.org/html/2605.06830)
Dan Ofer Department of Biological Chemistry The Hebrew University of Jerusalem

&Oriel Perets11footnotemark:1 Department of Computer and Information Science Ben\-Gurion University of the Negev

&Michal Linial Department of Biological Chemistry The Hebrew University of Jerusalem

&Nadav Rappoport Department of Computer and Information Science Ben\-Gurion University of the Negev

###### Abstract

Protein language models \(pLMs\) produce per\-residue representations that capture evolutionary and structural information, yet their mean\-pooled sequence embeddings are not explicitly trained to reflect functional, evolutionary or structural similarity between proteins\. We present Protein Sentence Transformers \(ProtSent\), a contrastive fine\-tuning framework for adapting PLMs into general\-purpose embedding models\. ProtSent trains with MultipleNegativesRankingLoss across five protein\-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, and StringDB protein–protein interactions, and Deep Mutational Scanning data\. We evaluate on 23 downstream tasks using frozen embeddings with akk\-nearest\-neighbor probe to measure embedding neighborhood quality\. On ESM\-2 150M, ProtSent improves 15 of 23 tasks, with gains of\+105%\+105\\%on remote homology detection,\+17%\+17\\%on variant effect prediction, and\+19\.9%\+19\.9\\%Recall@1 on SCOPe\-40 structural retrieval\. The 35M variant improves 16 of 23 tasks with\+40\.5%\+40\.5\\%on remote homology and\+15\.5%\+15\.5\\%Recall@1 on SCOPe\-40\. Contrastive fine\-tuning restructures the embedding space to better capture protein function and structure, without any task\-specific supervision\. We release the models, public data, and training recipe and code\.

## 1Introduction

In natural language processing, this analogous limitation of BERT\-style models was addressed by Sentence\-BERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.06830#bib.bib3)\), which applies contrastive learning to restructure the embedding space so that semantically similar sentences become neighbors\. Contrastive objectives optimizes the metric that downstream tasks rely on: embedding proximity\.

We apply this principle to protein language models\. Our framework, Protein SentenceBERT \(ProtSent\), fine\-tunes PLM backbones end\-to\-end using MultipleNegativesRankingLoss \(MNRL\)\(Hendersonet al\.,[2017](https://arxiv.org/html/2605.06830#bib.bib4)\)across multiple datasets, each capturing a different axis of biological relatedness\. We combine five data sources: \(i\) Pfam family membership and \(ii\) structurally derived hard negatives, \(iiI\) AlphaFold DB structural pairs, \(iv\) StringDB interaction pairs and \(v\) deep mutational scanning \(DMS\) data, using a CoSENT loss\(Su,[2022](https://arxiv.org/html/2605.06830#bib.bib5)\)to capture continuous fitness landscapes\.

We evaluate ProtSent on a suite of 23 downstream tasks using a deliberately simple protocol: embeddings are frozen and evaluated with akk\-nearest\-neighbor \(KNN\) probe\. This evaluation strategy measures the quality of the embedding space geometry, instead of the capacity of a learned classifier on top of it\. The rationale is that if contrastive training successfully restructures the neighborhood structure, a probe that relies entirely on neighbor identity should be the most sensitive detector of improvement\.

Our contributions are as follows:

- •We introduce ProtSent, a contrastive fine\-tuning framework for protein language models that combines five protein\-pair datasets with round\-robin sampling\.
- •We demonstrate that contrastive fine\-tuning produces substantial gains on tasks that depend on embedding neighborhood quality, including\+105%\+105\\%on remote homology detection and\+19\.9%\+19\.9\\%Recall@1 on SCOPe\-40 structural retrieval \(ESM\-2 150M\), with similar gains at the 35M scale \(\+40\.5%\+40\.5\\%and\+15\.5%\+15\.5\\%respectively\)\.
- •We show that these improvements are consistent across two model scales \(35M and 150M parameters\) and multiple biological tasks spanning function, structure, engineering and mutation\.
- •We provide ablation studies isolating the contribution of each training data source and the sampling strategy to the final performance\.
- •

## 2Related work

#### Protein language models\.

Self\-supervised protein language models learn residue\-level representations from large sequence databases\. ESM\-1b and ESM\-2\(Riveset al\.,[2021](https://arxiv.org/html/2605.06830#bib.bib1); Linet al\.,[2023](https://arxiv.org/html/2605.06830#bib.bib2)\)train Transformer encoders with masked language modeling on millions of UniRef sequences, producing embeddings that encode evolutionary conservation, secondary structure, and contact information\. ESMFold\(Linet al\.,[2023](https://arxiv.org/html/2605.06830#bib.bib2)\)extended this approach to atomic\-resolution structure prediction\. ProtTrans\(Elnaggaret al\.,[2022](https://arxiv.org/html/2605.06830#bib.bib21)\)explored several architectures \(BERT, Albert, T5\) at billion\-parameter scale and showed that per\-residue representations transfer well to secondary\-structure and localization tasks\. TAPE\(Raoet al\.,[2019](https://arxiv.org/html/2605.06830#bib.bib12)\)introduced a standardized benchmark suite and demonstrated that pretrained representations improve over hand\-crafted features across five protein tasks\. A common theme is that downstream transfer is typically mediated by a learned probe, a linear layer or small MLP trained on task\-specific labels\. In contrast, ProtSent targets the sequence\-level embedding space, asking whether contrastive fine\-tuning can restructure it so that nearest\-neighbor lookup alone is sufficient for multiple downstream tasks\.

#### Contrastive fine\-tuning of language models\.

In NLP, Sentence\-BERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.06830#bib.bib3)\)showed that fine\-tuning BERT with a siamese objective on natural\-language inference pairs produces sentence embeddings whose cosine similarity correlates with semantic similarity, reducing the computational cost of sentence comparison from quadratic cross\-encoder inference to a single embedding lookup\. Subsequent work extended this to multilingual and domain\-specific settings\(Reimers and Gurevych,[2020](https://arxiv.org/html/2605.06830#bib.bib22)\)\. For proteins,Heinzingeret al\.\([2022](https://arxiv.org/html/2605.06830#bib.bib15)\)proposed ProtTucker, which fine\-tunes ProtT5 with triplet loss on CATH superfamily labels \(S30 subset, 3,186 training proteins\) and demonstrated improved remote\-homology detection\. Independently,Redlet al\.\([2023](https://arxiv.org/html/2605.06830#bib.bib16)\)adapted the SentenceTransformers framework to ProtBERT and showed that contrastive training on disorder and stability annotations improves performance on those specific tasks\. Our work differs from both in scale and breadth: ProtSent trains on over 70 million protein pairs drawn from five heterogeneous data sources \(Pfam families, structural pairs, interaction networks, hard negatives, and DMS\), and evaluates on 23 tasks spanning classification, regression, and retrieval, rather than a single axis of similarity\. We aim to present a foundation model for universal use\.

#### Embedding\-based protein search and retrieval\.

Traditional sequence search tools such as BLAST and HMMer rely on local alignment heuristics and profile Hidden Markov Models\(Söding,[2005](https://arxiv.org/html/2605.06830#bib.bib31); Steinegger and Söding,[2017](https://arxiv.org/html/2605.06830#bib.bib32)\)\. Recent works has explored learned embeddings as an alternative\. PLMSearch\(Liuet al\.,[2024](https://arxiv.org/html/2605.06830#bib.bib17)\)combines ESM\-1b embeddings with a lightweight cross\-attention module to re\-rank homology search results, achieving higher sensitivity than HMMer on remote homologs\.Honget al\.\([2024](https://arxiv.org/html/2605.06830#bib.bib18)\)introduced a dense homolog retrieval \(DHR\) system that trains a dual\-encoder with contrastive learning on SCOPe domains and reports 93% sensitivity at 1% false\-positive rate\. Foldseek\(van Kempenet al\.,[2024](https://arxiv.org/html/2605.06830#bib.bib20)\)takes a different approach, encoding 3D backbone geometry as a structural alphabet and performing fast structural search without full alignment\. These systems are optimized end\-to\-end for retrieval; ProtSent instead targets general\-purpose embeddings that transfer across diverse tasks, with retrieval \(SCOPe\-40\) serving as one evaluation axis among many\.

#### Protein function prediction\.

Predicting enzyme class, fitness effects, or other functional properties from sequence remains a core challenge\. GOBeacon\(Linet al\.,[2025](https://arxiv.org/html/2605.06830#bib.bib19)\)recently proposed an ensemble framework that applies contrastive regularization to multi\-label GO classifiers, reporting improvements on CAFA\-style benchmarks\(Jianget al\.,[2016](https://arxiv.org/html/2605.06830#bib.bib28)\)\. Unsupervised and anomaly\-based approaches do not necessarily beat supervised methods\(Michael\-Pitschazeet al\.,[2024](https://arxiv.org/html/2605.06830#bib.bib30); Ofer and Linial,[2025](https://arxiv.org/html/2605.06830#bib.bib29)\)\. A related subproblem is variant\-effect prediction, where supervised approaches train on deep mutational scanning \(DMS\) assays and zero\-shot methods leverage pLM log\-likelihoods\(Riveset al\.,[2021](https://arxiv.org/html/2605.06830#bib.bib1)\)\. ProtSent takes a middle path: it uses DMS data during training as an auxiliary regression signal \(CoSENT loss\) but does not train a task\-specific predictor, instead relying on the restructured embedding space to capture continuous fitness landscapes\.

## 3Method

### 3\.1Problem formulation

Letfθ:𝒮→ℝdf\_\{\\theta\}:\\mathcal\{S\}\\rightarrow\\mathbb\{R\}^\{d\}denote a protein language model parameterized byθ\\thetathat maps an amino acid sequences∈𝒮s\\in\\mathcal\{S\}to add\-dimensional embedding\. Given a set of protein pairs\(si,sj\)\(s\_\{i\},s\_\{j\}\)drawn from biological relationships \(shared family, structural similarity, physical interaction\), our goal is to fine\-tuneθ\\thetaso that the cosine similaritysim\(fθ\(si\),fθ\(sj\)\)\\text\{sim\}\(f\_\{\\theta\}\(s\_\{i\}\),f\_\{\\theta\}\(s\_\{j\}\)\)is high for biologically related pairs and low for unrelated pairs\. After fine\-tuning, the embedding space should support downstream tasks through simple nearest\-neighbor lookup, without task\-specific supervision\.

### 3\.2Backbone architecture

We use ESM\-2\(Linet al\.,[2023](https://arxiv.org/html/2605.06830#bib.bib2)\)as the backbone encoder, wrapped in the SentenceTransformers framework\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.06830#bib.bib3)\)\. ESM\-2 is a protein language model pretrained with masked language modeling on UniRef50 sequences\(Suzeket al\.,[2015](https://arxiv.org/html/2605.06830#bib.bib24)\)\. We experiment with two model scales: ESM\-2 35M \(12 layers, 480\-dimensional embeddings\) and ESM\-2 150M \(30 layers, 640\-dimensional embeddings\)\. During contrastive fine\-tuning, all Transformer layers are fine\-tuned\. Sequence\-level embeddings are obtained by mean\-pooling over non\-padding residue tokens, producing a single vector𝐡∈ℝd\\mathbf\{h\}\\in\\mathbb\{R\}^\{d\}per protein\. Input sequences are truncated to first 512 residues; we do not apply center cropping\.

### 3\.3Training data

ProtSent trains on five protein\-pair datasets summarized in table Table[1](https://arxiv.org/html/2605.06830#S3.T1)\. Further details and filtering parameters are in Appendix[12](https://arxiv.org/html/2605.06830#S12):

Table 1:Training datasets after all filtering, clustering, and decontamination \(raw upstream sizes are larger; see Appendix[12](https://arxiv.org/html/2605.06830#S12)\)\. “Group label” is the positive\-pair criterion for MNRL sources; STRING is pair\-native and DMS uses continuous CoSENT scores\.#### Pfam family pairs\.

Pfam\-A full\-alignment domains\. Positives are pairs belonging to the same family\(Mistryet al\.,[2021](https://arxiv.org/html/2605.06830#bib.bib6)\)\. We deduplicate with MMseqs2easy\-linclustat70%70\\%identity, leaving 32\.9M domains in 26,796 families and 15,284 clans\. In\-batch negatives are other families within each batch\. Singletons families are dropped\. This dataset encodes evolutionary and functional homology at the domain level\.

#### Pfam hard negatives\.

We generate novel hard evolutionary negatives\. For each anchor per family, using the Pfam\-A Hidden Markov Model match\-state emissions \(PSSM\), we sample mutants with≥3\\geq 3point substitutions at distinct positions spaced at leastmax⁡\(6,L/8\)\\max\(6,L/8\)apart, drawn from positions with per\-residueΔS<−1\.0\\Delta S<\-1\.0, until the total log\-odds drop satisfies∑iΔSi≤−16\.0\\sum\_\{i\}\\Delta S\_\{i\}\\leq\-16\.0\. This is an arbitrary cutoff for being likely to "break the conserved profile" and be deleterious, while retaining \>98% similarity\. These hard negatives force the model to learn fine\-grained discriminative features beyond simple sequence identity\.

#### AlphaFold DB structural pairs\.

AFDB50 sequences \(pLDDT\>70\>70, non\-fragment\), from AlphaFold DB\(Varadiet al\.,[2022](https://arxiv.org/html/2605.06830#bib.bib7); Barrio\-Hernandezet al\.,[2023](https://arxiv.org/html/2605.06830#bib.bib27)\)predicted structures are joined on their AFDB50 representative with the Steinegger\-Lab AFDB Foldseek\(van Kempenet al\.,[2024](https://arxiv.org/html/2605.06830#bib.bib20)\)structural clusters \(cluster flags\{1,2\}\\\{1,2\\\}\)\. The positive label is the Foldseek cluster\. This provides large scale structural supervision, forcing the embedding space to capture three\-dimensional structural relationships\. These are unsupervised, predicted clusters \(unlike curated scop families\)\.

#### StringDB interaction pairs\.

Protein–protein interaction pairs from the STRING database\(Szklarczyket al\.,[2023](https://arxiv.org/html/2605.06830#bib.bib8)\), filtered tocombined\_score≥400\\geq 400\(medium confidence\)\. These capture functional association, co\-occurrence in pathways, complexes or co\-expression, providing a distinct biological relationship axis from homology or structural similarity\. Sequences with over 50% sequence identity, by MMseqs2 linclust, to any protein in the Bernett PPI benchmark test set were removed from the StringDB data to prevent data leakage\. Remaining sequences were globally deduplicated to 50% identity, and filtered for lengths\[10,1024\]\[10,1024\]\.

#### DMS fitness data\.

ProteinGym DMS and clinical substitutions and indels scores are per\-assay z\-scored, clipped to\[−3,3\]\[\-3,3\]and rescaled to\[0,1\]\[0,1\]; clinical labels mapped Pathogenic→0\\to 0, Benign→1\\to 1\. This auxiliary loss operates on single proteins rather than pairs\. Assay families that overlap downstream evaluation \(GB1 and GFP variants\) are dropped, and the supervised\-benchmark test fold is removed using its explicit split metadata where present, or the same deterministic per\-group80/2080/20split otherwise, so no DMS row used in training is later evaluated as supervised regression\.

#### Leakage controls\.

The primary risk of train–test overlap arises from the Pfam and AlphaFold DB datasets\. For Pfam, training pairs are sampled from family membership labels, which do not overlap with the held\-out evaluation splits used for individual downstream tasks \(e\.g\., the remote homology fold\-level evaluation uses a disjoint fold partition, not family\-level labels\)\. For AlphaFold, training uses Foldseek cluster co\-membership rather than SCOPe labels; we do not filter AFDB sequences against SCOPe test domains, so partial sequence overlap is possible\. We note this as a limitation: while the training labels \(Foldseek clusters\) and evaluation labels \(SCOPe superfamilies\) are drawn from different classification systems, the underlying sequences may overlap\. For DMS data we exclude ProteinGym from our evaluation due to overlap\. For the remaining evaluation tasks \(solubility, signal peptide, etc\.\), the training data provides only pairwise relationship labels, not task\-specific annotations\.

### 3\.4Training objective

The primary contrastive objective is MultipleNegativesRankingLoss \(MNRL\)\(Hendersonet al\.,[2017](https://arxiv.org/html/2605.06830#bib.bib4)\), defined over mini\-batches of positive pairs\. For a batch ofNNanchor\-positive pairs\{\(si,sj\)\}i=1N\\\{\(s\_\{i\},s\_\{j\}\)\\\}\_\{i=1\}^\{N\}, the loss treats all other positives in the batch as negatives:

ℒMNRL=−1N∑i=1Nlog⁡exp⁡\(sim\(fθ\(si\),fθ\(sj\)\)/τ\)∑j=1Nexp⁡\(sim\(fθ\(si\),fθ\(sj\+\)\)/τ\)\\mathcal\{L\}\_\{\\text\{MNRL\}\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\log\\frac\{\\exp\(\\text\{sim\}\(f\_\{\\theta\}\(s\_\{i\}\),f\_\{\\theta\}\(s\_\{j\}\)\)/\\tau\)\}\{\\sum\_\{j=1\}^\{N\}\\exp\(\\text\{sim\}\(f\_\{\\theta\}\(s\_\{i\}\),f\_\{\\theta\}\(s\_\{j\}^\{\+\}\)\)/\\tau\)\}\(1\)
whereτ\\tauis a temperature parameter \(we useτ=0\.05\\tau=0\.05, equivalently a scale factor of 20\) andsim\(⋅,⋅\)\\text\{sim\}\(\\cdot,\\cdot\)denotes cosine similarity\.

For the Pfam hard negatives dataset, explicit negative columns are appended to the batch, providing additional challenging negatives beyond the in\-batch negatives\.

The auxiliary DMS loss uses CoSENTLoss\(Su,[2022](https://arxiv.org/html/2605.06830#bib.bib5)\), a ranking\-based objective that preserves the ordering of continuous fitness scores among protein variants\. The DMS dataset is included as an additional entry in the round\-robin sampler \(Section[3\.5](https://arxiv.org/html/2605.06830#S3.SS5)\), receiving equal per\-step weight with the four MNRL datasets\. At each training step, the sampler draws a batch from exactly one dataset; thus the DMS loss contributes to approximately one\-fifth of all gradient updates rather than being weighted by an explicit coefficientλ\\lambda\.

### 3\.5Multi\-dataset sampling

The datasets are combined using round\-robin sampling: at each training step, the dataloader draws a batch from the next dataset in a cyclic order\. This ensures equal representation of each biological signal source regardless of dataset size, preventing the largest dataset from dominating the gradient updates\.

### 3\.6Training configuration

Models were trained on a single NVIDIA RTX 6000 Ada GPU \(48 GB\)\. We use AdamW with a cosine learning rate schedule, 0\.1 dropout and an effective batch size of 1024\. Up to 70M training pairs are generated from the source datasets\. Full hyperparameters are listed in Table[6](https://arxiv.org/html/2605.06830#S8.T6)\(Appendix[8](https://arxiv.org/html/2605.06830#S8)\)\.

## 4Experiments

### 4\.1Evaluation protocol

We evaluate all models using a frozen embedding protocol: the fine\-tuned \(or baseline\) model encodes each protein sequence into a fixed\-length vector, and akk\-nearest\-neighbor \(KNN\) probe withk=3k\{=\}3and Euclidean distance is trained on the resulting embeddings\. Although training optimizes cosine similarity, for approximately L2\-normalized embeddings \(as produced by contrastive training\), Euclidean distance and cosine distance induce the same neighbor ranking:‖a−b‖2=2\(1−cos⁡\(a,b\)\)\\\|a\-b\\\|^\{2\}=2\(1\-\\cos\(a,b\)\)when‖a‖=‖b‖=1\\\|a\\\|=\\\|b\\\|=1\. We use the scikit\-learn default \(Euclidean\) for compatibility; the SCOPe\-40 retrieval evaluation \(Section[5\.2](https://arxiv.org/html/2605.06830#S5.SS2)\) uses cosine distance and shows similar results\.

For binary and multiclass classification tasks, we report Area Under the ROC Curve \(AUC\)\. For multiclass tasks, we report macro\-averaged F1 score\. For regression tasks, we report Spearman rank correlation\. Evaluation uses a held\-out test split where available, with 4\-fold cross\-validation on the training set as a fallback\.

### 4\.2Benchmark tasks

We evaluate on 23 tasks in three types:

#### Binary classification \(8 tasks\)\.

Protein–protein interaction prediction \(Bernett Gold Standard PPI dataset\), solubility prediction \(DeepSol\(Khuranaet al\.,[2018](https://arxiv.org/html/2605.06830#bib.bib9)\)\), peptide\-HLA binding, metal ion binding, signal peptide prediction \(SignalP\(Teufelet al\.,[2022](https://arxiv.org/html/2605.06830#bib.bib10)\)\), neuropeptide precursor prediction \(NeuroPID\(Ofer and Linial,[2015](https://arxiv.org/html/2605.06830#bib.bib11)\)\), binary subcellular localization, and material production classification\.

#### Multiclass classification \(5 tasks\)\.

Remote homology detection at the fold level, enzyme commission \(EC\) number classification, subcellular localization \(10\-class\), antibiotic resistance mechanism classification, and temperature stability classification \(thermophile/mesophile/psychrophile\)\.

#### Regression \(10 tasks\)\.

Variant effect prediction \(GB1 fitness landscape\), fluorescence prediction \(TAPE benchmark\(Raoet al\.,[2019](https://arxiv.org/html/2605.06830#bib.bib12)\)\), protein stability \(Biomap\), thermostability \(FLIP benchmark\(Dallagoet al\.,[2021](https://arxiv.org/html/2605.06830#bib.bib13)\)\), optimal pH prediction, enzyme catalytic efficiency \(kcat\), cloning classification, beta\-lactamase fitness \(PEER benchmark\), AAV capsid fitness \(FLIP\), and RhlA enzyme mutation effects\.

### 4\.3SCOPe\-40 structural retrieval

In addition to the probe\-based evaluation, we evaluate embedding quality through a retrieval task on the SCOPe\-40 structural classification database\. For each query protein, we retrieve the nearest neighbors by cosine similarity in embedding space and measure Recall@KKforK∈\{1,10,30\}K\\in\\\{1,10,30\\\}, where a retrieval is considered correct if the retrieved protein belongs to the same structural superfamily as the query\.

## 5Results

### 5\.1KNN probe evaluation

Table[2](https://arxiv.org/html/2605.06830#S5.T2)presents the KNN probe evaluation across all 23 tasks for both the 35M and 150M backbones\. ProtSent improves 16 of 23 tasks at the 35M scale and 15 of 23 at 150M, with the largest gains on tasks that depend on structural or functional grouping in embedding space\.

At the 35M scale, the strongest improvements appear on remote homology detection \(\+40\.5%\+40\.5\\%\), RhlA enzyme mutations \(\+77\.2%\+77\.2\\%\), beta\-lactamase fitness \(\+18\.5%\+18\.5\\%\), and fluorescence \(\+15\.6%\+15\.6\\%\)\. These tasks require the model to group proteins by structural fold or mutational phenotype, contrastive training on Pfam families and AlphaFold structural pairs is designed to capture this realtionship\. Binary classification tasks improve broadly: PPI \(\+5\.3%\+5\.3\\%\), peptide\-HLA binding \(\+3\.6%\+3\.6\\%\), and solubility \(\+2\.7%\+2\.7\\%\)\.

The 150M backbone exhibits the same pattern with several notable differences\. Remote homology detection reaches\+105%\+105\\%, and EC classification improves by\+15\.9%\+15\.9\\%\(vs\.\+0\.3%\+0\.3\\%at 35M\), showing the larger backbone provides enough capacity for contrastive training to separate enzyme classes\. GB1 variant effect prediction improves by\+17\.3%\+17\.3\\%whereas the 35M model showed a slight decrease \(−0\.8%\-0\.8\\%\), indicating that capturing mutational fitness relationships benefits from additional model capacity\. Conversely, RhlA degrades at 150M \(−27\.1%\-27\.1\\%\) despite its large gain at 35M, and beta\-lactamase shows only\+0\.1%\+0\.1\\%at 150M vs\.\+18\.5%\+18\.5\\%at 35M\.

Stability and thermostability regression degrade modestly at both scales, possibly because the sequence level representation does not specifically preserve ordinal relationships within a protein’s mutational neighborhood in the case of point mutations\.

Table 2:KNN probe evaluation on 23 downstream tasks: baseline ESM\-2 vs\. ProtSent at two model scales \(35M and 150M parameters\)\.Δ\\Delta%=100×\(Trained−Baseline\)/Baseline=100\\times\(\\text\{Trained\}\-\\text\{Baseline\}\)/\\text\{Baseline\}\. PositiveΔ\\Delta% values arebolded\. Tasks are grouped by type\.ESM\-2 35MESM\-2 150MTaskMetricBaseProtSentΔ\\Delta%BaseProtSentΔ\\Delta%Multiclass classificationRemote Homology \(Fold\)F1M\.223\.313\+40\.5\.190\.390\+105\.0EC ClassificationF1M\.450\.452\+0\.3\.408\.473\+15\.9Subcellular LocalisationAUC\.784\.794\+1\.3\.797\.813\+1\.9Antibiotic ResistanceAUC\.785\.786\+0\.1\.786\.790\+0\.6Temperature StabilityF1M\.896\.873−2\.6\-2\.6\.906\.881−2\.8\-2\.8Binary classificationPPI \(Bernett\)AUC\.560\.589\+5\.3\.556\.592\+6\.4Peptide\-HLA BindingAUC\.748\.775\+3\.6\.749\.772\+3\.1Solubility \(DeepSol\)AUC\.692\.711\+2\.7\.712\.719\+1\.1Neuropeptide \(NeuroPID\)AUC\.948\.956\+0\.9\.963\.945−1\.8\-1\.8Binary Subcel\. Loc\.AUC\.925\.932\+0\.8\.936\.922−1\.5\-1\.5Signal PeptideAUC\.965\.971\+0\.7\.971\.972\+0\.1Metal Ion BindingAUC\.825\.830\+0\.5\.808\.843\+4\.3Material ProductionAUC\.755\.758\+0\.4\.754\.759\+0\.6RegressionRhlA Enzyme MutationsSpr\.\.236\.418\+77\.2\.345\.252−27\.1\-27\.1Beta\-lactamase \(PEER\)Spr\.\.670\.793\+18\.5\.732\.733\+0\.1Fluorescence \(TAPE\)Spr\.\.490\.567\+15\.6\.504\.569\+12\.7Variant Effect \(GB1\)Spr\.\.656\.651−0\.8\-0\.8\.670\.785\+17\.3Optimal pHSpr\.\.498\.514\+3\.1\.509\.512\+0\.6AAV Fitness \(FLIP\)Spr\.\.742\.729−1\.7\-1\.7\.706\.725\+2\.6Cloning ClassificationSpr\.\.398\.394−1\.0\-1\.0\.386\.389\+0\.8Enzyme Cat\. EfficiencySpr\.\.705\.687−2\.6\-2\.6\.705\.699−0\.9\-0\.9Stability \(Biomap\)Spr\.\.568\.547−3\.7\-3\.7\.588\.569−3\.3\-3\.3Thermostability \(FLIP\)Spr\.\.475\.449−5\.3\-5\.3\.467\.460−1\.4\-1\.4Tasks improved16 / 2315 / 23
### 5\.2SCOPe\-40 structural retrieval

Table[3](https://arxiv.org/html/2605.06830#S5.T3)reports retrieval performance on the SCOPe\-40 structural classification benchmark for both model scales\. ProtSent achieves consistent double\-digit improvements across all recall thresholds, demonstrating that contrastive fine\-tuning meaningfully reorganizes the embedding space to place structurally similar proteins closer together\.

Table 3:SCOPe\-40 structural retrieval results\. Recall@KKmeasures the fraction of queries for which a protein from the same structural superfamily appears among the top\-KKnearest neighbors\.The improvements are similar across both model scales and all recall thresholds\. The 150M model shows even larger gains \(\+19\.9%\+19\.9\\%at Recall@1\) than the 35M \(\+15\.5%\+15\.5\\%\), indicating the larger backbone provides more capacity for contrastive fine\-tuning to reorganize the embedding space\. The single nearest neighbor in the ProtSent embedding space is substantially more likely to share the query protein’s structural fold compared to the baseline ESM\-2 representation\.

### 5\.3Ablation studies

We conduct six ablation experiments on the ESM\-2 35M backbone\. Each ablation removes one data source, we also replace round\-robin with proportional sampling, retraining from scratch with otherwise identical hyperparameters\.

Table 4:Ablation study results \(ESM\-2 35M, KNN probe\)\. Each row shows the number of tasks improved \(out of 23\) and the mean relative change \(%\) across all tasks, both measured against the stock ESM\-2 baseline\. The full model row corresponds to the default ProtSent configuration\.Removing Pfam family pairs causes the largest degradation, reducing the number of improved tasks from 16 to 15 and the mean improvement from\+6\.7%\+6\.7\\%to\+4\.6%\+4\.6\\%\(Table[4](https://arxiv.org/html/2605.06830#S5.T4)\)\. Individual tasks that depend on evolutionary homology are most affected: beta\-lactamase fitness drops−8\.8%\-8\.8\\%and RhlA enzyme mutations drops−31\.5%\-31\.5\\%relative to the full model\. This confirms that Pfam provides the dominant contrastive signal\.

Training without the hard negatives dataset yields a surprising result: 20 of 23 tasks improve over the baseline, compared to 16 for the full model, with a higher mean delta \(\+7\.9%\+7\.9\\%vs\.\+6\.7%\+6\.7\\%\) \(Table[7](https://arxiv.org/html/2605.06830#S11.T7)\)\. This suggests that explicit hard negative mining, while potentially useful for fine\-grained discrimination, may introduce overly aggressive contrastive gradients that perturb the embedding space on some tasks\. EC classification improves by\+7\.4%\+7\.4\\%relative to the full model when hard negatives are removed\. We hypothesize that the in\-batch negatives alone provide sufficient contrastive signal for most downstream tasks, and that hard negative selection warrants further investigation\.

Removing AlphaFold DB structural pairs causes the second\-largest degradation after Pfam: the number of improved tasks drops from 16 to 13 and the mean delta falls from\+6\.7%\+6\.7\\%to\+3\.2%\+3\.2\\%\. EC classification is hit hardest \(−11\.0%\-11\.0\\%vs\. baseline\), and remote homology drops to\+15\.3%\+15\.3\\%\(from\+40\.5%\+40\.5\\%with the full model\), confirming that structural supervision from Foldseek clusters provides a complementary signal that Pfam family labels alone do not fully capture\. Removing StringDB has a milder effect: 17 tasks still improve and the mean delta matches the full model \(\+5\.9%\+5\.9\\%\)\. PPI prediction drops to−0\.5%\-0\.5\\%\(from\+5\.3%\+5\.3\\%\), confirming that StringDB interaction pairs drive the PPI improvement\. However, the overall embedding quality remains largely intact, suggesting the other three data sources provide sufficient contrastive signal for most tasks\.

Replacing round\-robin with proportional sampling yields 16 improved tasks with a mean delta of\+7\.0%\+7\.0\\%, comparable to the full model\. The modest difference indicates that the training is relatively robust to the sampling strategy, though individual task variation exists \(e\.g\., optimal pH drops−9\.4%\-9\.4\\%vs\. the full model under proportional sampling\)\.

Removing the DMS CoSENT loss modestly reduces performance \(15/23 tasks,\+5\.8%\+5\.8\\%vs\. 16/23,\+6\.7%\+6\.7\\%\)\. Fitness\-regression tasks are most affected: fluorescence drops from\+15\.6%\+15\.6\\%to\+10\.4%\+10\.4\\%and thermostability degrades further\. The auxiliary DMS signal thus provides a small but steady benefit to fitness\-landscape tasks without harming broader embedding quality\. Detailed per task changes per ablation are provided in the appendix\. These are single\-factor ablations; interactions between components remain unexplored and could yield additional insights\.

### 5\.4Few\-shot evaluation

A key motivation better embedding neighborhoods is improved performance under label scarcity\. We evaluate ProtSent in a few\-shot setting by subsampling the training set toN∈\{50,100,500,1000\}N\\in\\\{50,100,500,1000\\\}labeled examples per task and evaluating with the KNN probe \(k=3k\{=\}3\)\. Table[5](https://arxiv.org/html/2605.06830#S5.T5)reports the relative improvement \(Δ\\Delta%\) of ProtSent over the baseline ESM\-2 35M at each sample budget\.

Table 5:Few\-shot evaluation \(ESM\-2 35M, KNN probe\)\. Relative improvement \(Δ\\Delta%\) of ProtSent over baseline at each sample size\. Bold indicates improvement\.AtN≥100N\{\\geq\}100, ProtSent improves 10 of 14 evaluable tasks\. The largest gains appear on tasks that require grouping by structural or functional similarity: remote homology detection \(\+244\.5%\+244\.5\\%atN=100N\{=\}100,\+90\.2%\+90\.2\\%atN=1000N\{=\}1000\), subcellular localisation \(\+5\.6\+5\.6–29\.5%29\.5\\%\), and EC classification \(\+14\.1\+14\.1–64\.9%64\.9\\%\)\. These tasks benefit most from improved neighborhood structure because the KNN probe relies on finding correctly labeled neighbors in a small training set\. AtN=50N\{=\}50, results are noisy: withk=3k\{=\}3and only 50 training points, the probe is highly sensitive to stochastic neighbor selection, and several regression tasks show large negative deltas that stabilize at higher budgets\. Notably, tasks where ProtSent degrades in the few\-shot regime \(Variant Effect, Stability\) are the same tasks that show modest degradation with the full dataset, possibly representing a trade\-off of the contrastive objective rather than a few\-shot\-specific failure mode\.

## 6Discussion

#### Neighborhood structure as the primary improvement axis\.

The consistent pattern across both model scales is that tasks requiring accurate embedding neighborhoods, remote homology, structural retrieval, fluorescence prediction, show the largest improvements\. Contrastive training optimizes angular relationships between embeddings, restructuring neighborhoods to reflect the biological similarities encoded in the training data\. Tasks where the baseline already performed well \(e\.g\., signal peptide, temperature stability\) show smaller or negative changes, hinting the global reorganization of the space can disrupt incidental structure that the pretrained model had learned\.

#### Task\-specific analysis\.

The\+105%\+105\\%improvement on remote homology detection \(150M, KNN\) and\+40\.5%\+40\.5\\%\(35M, KNN\) is the most dramatic result\. Fold\-level homology requires recognizing structural similarity between proteins with minimal sequence identity, precisely the capability that contrastive training on Pfam families and AlphaFold structural pairs should confer\. The improvement on PPI prediction \(\+5\+5–6%6\\%across scales\) confirms that interaction partners are brought closer in embedding space by training on StringDB pairs\.

#### Regression tasks\.

Fluorescence and beta\-lactamase fitness improve at both scales, indicating that contrastive training preserves local manifold structure relevant to mutational effects\. GB1 variant effect prediction presents a scale\-dependent result:−0\.8%\-0\.8\\%at 35M but\+17\.3%\+17\.3\\%at 150M, indicating the larger backbone allows the contrastive objective to capture fitness landscape relationships without disrupting other properties\. Some regression tasks \(thermostability, stability\) degrade at both scales, possibly because the contrastive objective does not specifically preserve ordinal relationships within a single protein’s mutational neighborhood\. ProtSent is not designed to compete with specialized variant\-effect prediction methods that use supervised heads or retrieval\-augmented inference; instead, the VEP results characterize a side effect of general\-purpose embedding improvement on a task that was not directly optimized\.

#### Limitations\.

The training data also limited to the relationship types captured by the five source datasets; protein relationships not represented \(e\.g\., enzymatic mechanism similarity, expression patterns\) are not specifically optimized\. We do not compare to specialized retrieval systems \(ProtTucker, PLMSearch, DHR\) on matched benchmarks; our goal is general\-purpose embeddings rather than retrieval\-optimized models, but such comparisons would help quantify the generality–accuracy trade\-off\. Finally, all results are from single training runs without multi\-seed uncertainty estimates; the few\-shot results in particular show high variance at smallNN, and confidence intervals would strengthen the robustness claims\.

## 7Conclusion

We presented ProtSent, a contrastive fine\-tuning framework that transforms protein language model representations into function\-aware embeddings where proximity reflects biological relatedness\. By training with multiple data sources and round\-robin sampling, ProtSent achieves substantial improvements on tasks that depend on embedding neighborhood quality\. The uniform improvements across two model scales suggest that contrastive fine\-tuning is a broadly applicable strategy for adapting pretrained protein language models to produce embeddings suitable for retrieval, clustering, and few\-shot transfer\.

## References

- I\. Barrio\-Hernandez, J\. Yeo, J\. Jänes, M\. Mirdita, C\. L\. M\. Gilchrist, T\. Wein, M\. Varadi, S\. Velankar, P\. Beltrao, and M\. Steinegger \(2023\)Clustering predicted structures at the scale of the known protein universe\.Nature622\(7983\),pp\. 637–645\(en\)\.External Links:ISSN 1476\-4687,[Link](https://www.nature.com/articles/s41586-023-06510-w),[Document](https://dx.doi.org/10.1038/s41586-023-06510-w)Cited by:[§3\.3](https://arxiv.org/html/2605.06830#S3.SS3.SSS0.Px3.p1.2)\.
- C\. Dallago, J\. Mou, K\. E\. Johnston, B\. J\. Wittmann, N\. Bhatt, D\. Goldman, A\. Sadler, Z\. Wang,et al\.\(2021\)FLIP: benchmark tasks in fitness landscape inference for proteins\.bioRxiv\.Cited by:[§4\.2](https://arxiv.org/html/2605.06830#S4.SS2.SSS0.Px3.p1.1),[§9](https://arxiv.org/html/2605.06830#S9.SS0.SSS0.Px4.p1.1)\.
- A\. Elnaggar, M\. Heinzinger, C\. Dallago, G\. Rehawi, Y\. Wang, L\. Jones, T\. Gibbs, T\. Feher, C\. Angerer, M\. Steinegger, D\. Bhowmik, and B\. Rost \(2022\)ProtTrans: toward understanding the language of life through self\-supervised learning\.IEEE Transactions on Pattern Analysis and Machine Intelligence44\(10\),pp\. 7112–7127\.Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Heinzinger, M\. Littmann, I\. Sillitoe, C\. A\. Orengo, and B\. Rost \(2022\)Contrastive learning on protein embeddings enlightens midnight zone\.NAR Genomics and Bioinformatics4\(2\),pp\. lqac043\.Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Henderson, R\. Al\-Rfou, B\. Strope, Y\. Sung, L\. Lukács, R\. Guo, S\. Kumar, B\. Miklos, and R\. Kurzweil \(2017\)Efficient natural language response suggestion for smart reply\.InarXiv preprint arXiv:1705\.00652,Cited by:[§1](https://arxiv.org/html/2605.06830#S1.p2.1),[§3\.4](https://arxiv.org/html/2605.06830#S3.SS4.p1.2)\.
- L\. Hong, S\. Sun, L\. Li, H\. Tian, M\. Li,et al\.\(2024\)Dense homolog retrieval for protein function and structure prediction\.Nature Biotechnology\.Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Jiang, T\. R\. Oron, W\. T\. Clark,et al\.\(2016\)An expanded evaluation of protein function prediction methods shows an improvement in accuracy\.Genome Biology17\(1\)\.Note:arXiv: 1601\.00891 Genre: Quantitative MethodsExternal Links:ISSN 1474760X,[Link](http://arxiv.org/abs/1601.00891),[Document](https://dx.doi.org/10.1186/s13059-016-1037-6)Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px4.p1.1)\.
- S\. Khurana, R\. Rawi, K\. Kuber, S\. Patchber, W\. Bai, M\. R\. Garvin, T\. Ideker, W\. Zhang, S\. Doerr, N\. Guilhot,et al\.\(2018\)DeepSol: a deep learning framework for sequence\-based protein solubility prediction\.Bioinformatics34\(15\),pp\. 2605–2613\.Cited by:[§4\.2](https://arxiv.org/html/2605.06830#S4.SS2.SSS0.Px1.p1.1)\.
- Z\. Lin, H\. Akin, R\. Rao, B\. Hie, Z\. Zhu, W\. Lu, N\. Smetanin, R\. Verkuil, O\. Kabeli, Y\. Shmueli,et al\.\(2023\)Evolutionary\-scale prediction of atomic\-level protein structure with a language model\.Science379\(6637\),pp\. 1123–1130\.Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.06830#S3.SS2.p1.1)\.
- Z\. Lin, C\. Xu,et al\.\(2025\)GOBeacon: gene ontology prediction with ensemble learning and contrastive regularization\.Protein Science34\.Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px4.p1.1)\.
- W\. Liu, Z\. Wang, R\. You, C\. Xie, H\. Wei, Y\. Xiong, J\. Yang, and S\. Zhu \(2024\)PLMSearch: protein language model powers accurate and fast sequence search for remote homology\.Nature Communications15\(1\),pp\. 2775\.Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Michael\-Pitschaze, N\. Cohen, D\. Ofer, Y\. Hoshen, and M\. Linial \(2024\)Detecting anomalous proteins using deep representations\.NAR Genomics and Bioinformatics6\(1\),pp\. lqae021\.External Links:ISSN 2631\-9268,[Link](https://doi.org/10.1093/nargab/lqae021),[Document](https://dx.doi.org/10.1093/nargab/lqae021)Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Mistry, S\. Chuguransky, L\. Williams, M\. Qureshi, G\. A\. Salazar, E\. L\. Sonnhammer, S\. C\. Tosatto, L\. Paladin, S\. Raj, L\. J\. Richardson,et al\.\(2021\)Pfam: the protein families database in 2021\.Nucleic Acids Research49\(D1\),pp\. D99–D105\.Cited by:[§3\.3](https://arxiv.org/html/2605.06830#S3.SS3.SSS0.Px1.p1.1)\.
- D\. Ofer and M\. Linial \(2015\)ProFET: feature engineering captures high\-level protein functions\.Bioinformatics31\(21\),pp\. 3429–3436\.Cited by:[§4\.2](https://arxiv.org/html/2605.06830#S4.SS2.SSS0.Px1.p1.1)\.
- D\. Ofer and M\. Linial \(2025\)Protein Language Models Expose Viral Immune Mimicry\.Viruses17\(9\) \(en\)\.External Links:ISSN 1999\-4915,[Link](https://www.mdpi.com/1999-4915/17/9/1199),[Document](https://dx.doi.org/10.3390/v17091199)Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px4.p1.1)\.
- R\. Rao, N\. Bhatt, A\. Lu, M\. C\. Cowperthwaite, P\. A\. Romero, and A\. Zhong \(2019\)Evaluating protein transfer learning with TAPE\.Advances in Neural Information Processing Systems32\.Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.06830#S4.SS2.SSS0.Px3.p1.1),[§9](https://arxiv.org/html/2605.06830#S9.SS0.SSS0.Px4.p1.1)\.
- I\. Redl, R\. Lunkad, C\. Genis\-Chalamanch, S\. Bottaro, H\. Penedones, and O\. Michielin \(2023\)Optimizing protein language models with Sentence Transformers\.InNeurIPS 2023 Workshop on Machine Learning for Structural Biology,Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-BERT: sentence embeddings using siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 3982–3992\.Cited by:[§1](https://arxiv.org/html/2605.06830#S1.p1.1),[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.06830#S3.SS2.p1.1),[§8](https://arxiv.org/html/2605.06830#S8.p1.1)\.
- N\. Reimers and I\. Gurevych \(2020\)Making monolingual sentence embeddings multilingual using knowledge distillation\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 4512–4525\.Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Rives, J\. Meier, T\. Sercu, S\. Goyal, Z\. Lin, J\. Liu, D\. Guo, M\. Ott, C\. L\. Zitnick, J\. Ma, and R\. Fergus \(2021\)Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences\.Proceedings of the National Academy of Sciences118\(15\),pp\. e2016239118\.Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Söding \(2005\)Protein homology detection by HMM\-HMM comparison\.Bioinformatics21\(7\),pp\. 951–960\.Note:ISBN: 1367\-4803 \(Print\)\\r1367\-4803 \(Linking\)External Links:ISSN 13674803,[Document](https://dx.doi.org/10.1093/bioinformatics/bti125)Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Steinegger and J\. Söding \(2017\)MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets\.Nature Biotechnology35\(11\),pp\. 1026–1028\(en\)\.External Links:ISSN 1087\-0156, 1546\-1696,[Link](http://www.nature.com/articles/nbt.3988),[Document](https://dx.doi.org/10.1038/nbt.3988)Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Su \(2022\)CoSENT: a more efficient sentence vector scheme than sentence\-bert\.Blog post\.Note:[https://kexue\.fm/archives/8847](https://kexue.fm/archives/8847)Cited by:[§1](https://arxiv.org/html/2605.06830#S1.p2.1),[§3\.4](https://arxiv.org/html/2605.06830#S3.SS4.p5.1)\.
- B\. E\. Suzek, Y\. Wang, H\. Huang, P\. B\. McGarvey, and C\. H\. Wu \(2015\)UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches\.\.Bioinformatics \(Oxford, England\)31\(6\),pp\. 926–32\.External Links:ISSN 1367\-4811,[Link](http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4375400&tool=pmcentrez&rendertype=abstract),[Document](https://dx.doi.org/10.1093/bioinformatics/btu739)Cited by:[§3\.2](https://arxiv.org/html/2605.06830#S3.SS2.p1.1)\.
- D\. Szklarczyk, R\. Kirsch, M\. Koutrouli, K\. Nastou, F\. Mehryary, R\. Hachilif, A\. L\. Gable, T\. Fang, N\. T\. Doncheva, S\. Pyysalo,et al\.\(2023\)The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest\.Nucleic Acids Research51\(D1\),pp\. D483–D489\.Cited by:[§3\.3](https://arxiv.org/html/2605.06830#S3.SS3.SSS0.Px4.p1.2)\.
- F\. Teufel, J\. J\. Almagro Armenteros, A\. R\. Johansen, M\. H\. Gíslason, S\. I\. Piber, K\. D\. Tsirigos, O\. Winther, S\. Brunak, G\. von Heijne, and H\. Nielsen \(2022\)SignalP 6\.0 predicts all five types of signal peptides using protein language models\.Nature Biotechnology40\(7\),pp\. 1023–1025\.Cited by:[§4\.2](https://arxiv.org/html/2605.06830#S4.SS2.SSS0.Px1.p1.1)\.
- M\. van Kempen, S\. S\. Kim, C\. Tumescheit, M\. Mirdita, J\. Lee, C\. L\. M\. Gilchrist, J\. Söding, and M\. Steinegger \(2024\)Fast and accurate protein structure search with Foldseek\.Nature Biotechnology42\(2\),pp\. 243–246\.Cited by:[§2](https://arxiv.org/html/2605.06830#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2605.06830#S3.SS3.SSS0.Px3.p1.2)\.
- M\. Varadi, S\. Anyango, M\. Deshpande, S\. Nair, C\. Natassia, G\. Yordanova, D\. Yuan, O\. Stroe, G\. Wood, A\. Laydon,et al\.\(2022\)AlphaFold Protein Structure Database: massively expanding the structural coverage of protein\-sequence space with high\-accuracy models\.Nucleic Acids Research50\(D1\),pp\. D439–D444\.Cited by:[§3\.3](https://arxiv.org/html/2605.06830#S3.SS3.SSS0.Px3.p1.2)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz,et al\.\(2020\)Transformers: state\-of\-the\-art natural language processing\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,pp\. 38–45\.Cited by:[§8](https://arxiv.org/html/2605.06830#S8.p1.1)\.

## 8Additional training details

Training was conducted on NVIDIA RTX 6000 Ada GPUs \(48 GB VRAM\) on an HPC cluster\. The 35M model trains in approximately 3–4 hours; the 150M model trains in approximately 1\.3 days\. We use the SentenceTransformers library\[Reimers and Gurevych,[2019](https://arxiv.org/html/2605.06830#bib.bib3)\]built on top of HuggingFace Transformers\[Wolfet al\.,[2020](https://arxiv.org/html/2605.06830#bib.bib14)\]\. All models are trained for a single epoch over 70M generated pairs\. Full hyperparameters are listed in Table[6](https://arxiv.org/html/2605.06830#S8.T6)\.

Table 6:Training hyperparameters for both model scales\.
## 9Full benchmark task descriptions

Below we provide additional details for selected evaluation tasks\. Metrics and probe types for all 23 tasks are specified in Section[4\.1](https://arxiv.org/html/2605.06830#S4.SS1)and the results tables\.

#### Remote homology detection\.

Fold\-level classification from the SCOPe database\. Training and test sets are split by superfamily so that no superfamily appears in both; the model must recognize structural similarity across evolutionary distant sequences\. We report macro\-averaged F1 across fold classes\.

#### SCOPe\-40 structural retrieval\.

A retrieval task over SCOPe domains filtered at 40% sequence identity\. For each query, we retrieve nearest neighbors by cosine similarity and evaluate Recall@KKforK∈\{1,10,30\}K\\in\\\{1,10,30\\\}, where a hit is correct if it shares the query’s structural superfamily\. This task uses the full validation set \(100,000 proteins\) with no subsampling\.

#### Variant effect prediction \(GB1\)\.

Regression on the GB1 protein fitness landscape, where each variant is scored by its experimentally measured binding affinity\. We report Spearman rank correlation between predicted and true fitness values\. The GB1 landscape contains∼\\sim150,000 single and multi\-site variants\.

#### Other tasks\.

The remaining tasks follow standard formulations from the TAPE\[Raoet al\.,[2019](https://arxiv.org/html/2605.06830#bib.bib12)\], FLIP\[Dallagoet al\.,[2021](https://arxiv.org/html/2605.06830#bib.bib13)\], and PEER benchmark suites\. Binary classification tasks \(PPI, solubility, signal peptide, metal ion binding, peptide\-HLA, neuropeptide, binary subcellular localization, material production\) are evaluated by AUC\. Multiclass tasks \(EC number, subcellular localization, antibiotic resistance, temperature stability\) use AUC or macro F1\. Regression tasks \(fluorescence, stability, thermostability, optimal pH, enzyme catalytic efficiency, cloning, beta\-lactamase, AAV, RhlA\) use Spearman correlation\. All tasks use held\-out validation splits where available, with 4\-fold cross\-validation as a fallback\.

## 10UMAP Visualization of Embedding Space

Figure[1](https://arxiv.org/html/2605.06830#S10.F1)shows UMAP projections of protein embeddings from the baseline ESM\-2 and ProtSent models, computed on held\-out SCOPe\-40 domains colored by fold\-level and superfamily\-level labels, as well as on Pfam sequences colored by family\. Across all three groupings and both model scales, the ProtSent embeddings exhibit visually tighter and more separated clusters compared to the baseline\. The effect is most pronounced at the fold level, where baseline embeddings show substantial overlap between structural classes that is reduced after contrastive fine\-tuning\. We note that UMAP projections are sensitive to hyperparameters and do not constitute a quantitative evaluation; these visualizations are intended as a qualitative complement to the retrieval and probe results\.

![Refer to caption](https://arxiv.org/html/2605.06830v1/figures/umap_scopre_superfamily.png)\(a\)SCOPe\-40, fold level
![Refer to caption](https://arxiv.org/html/2605.06830v1/figures/umap_pfam_fold_150m.png)\(b\)Pfam families

Figure 1:UMAP projections of baseline ESM\-2 150M \(left in each panel\) vs\. ProtSent 150M \(right\)\. Points are colored by the 10 most frequent groups; remaining groups in gray\. Scopr\-40 at the superfamily level, and Pfam families\.
## 11Full Per\-Task Ablation Results

Table[7](https://arxiv.org/html/2605.06830#S11.T7)provides the complete per\-task breakdown of all ablation experiments\. Each cell reports the relative change \(%\) of the ablated model compared to the stock ESM\-2 35M baseline, using a KNN probe \(except multilabel tasks which require a linear probe\)\. The summary row counts tasks improved and reports the mean relative delta across the 23 KNN\-evaluated tasks\.

Table 7:Per\-task ablation results \(ESM\-2 35M, KNN probe except multilabel tasks which use linear probe\)\. Each cell shows the relative change \(%\) vs\. the stock ESM\-2 baseline\. Bold indicates best configuration per task\.
## 12Dataset construction details

Each training parquet is reproduced bypython data\_prep\.py \-\-dataset Xat default flags \(with one exception, noted below for AFDB\)\.

### 12\.1Pfam \(pfam\_sorted\.parquet\)

Source\.Pfam\-A\.fasta\.gzandPfam\-A\.clans\.tsv\.gzfromftp\.ebi\.ac\.uk/pub/databases/Pfam/current\_release\. Headers parse to\(domain\_id, family\_id, sequence\)with the PFxxxxx version stripped\.

Pipeline\.\(1\) MMseqs2easy\-linclust \-\-min\-seq\-id 0\.7 \-\-cov\-mode 1 \-c 0\.8on the raw Pfam\-A FASTA \(global asymmetric clustering; family members assigned to representatives from other families are dropped\)\. \(2\) Left\-join family→\\toclan; orphan families inheritclan\_id := family\_id\. \(3\) Drop singleton families\. \(4\) Sort by\(clan\_id, family\_id\)so windowed slicing preserves clan\-level diversity during training\.

Final\.32,943,498 sequences, 26,796 families, 15,284 clans \(input:∼\\sim62M raw Pfam\-A FASTA records\)\.

### 12\.2Pfam hard negatives \(pfam\_hard\_negatives\.parquet\)

Source\.pfam\_sorted\.parquetabove plusPfam\-A\.hmm\.gz\(parsed withpyhmmer\)\.

PSSM\.For each Pfam HMM, take match\-state emissionsei,ae\_\{i,a\}and the pyhmmer amino\-acid backgroundbab\_\{a\}\(both clipped at10−910^\{\-9\}\), setSi,a=log2⁡ei,a−log2⁡baS\_\{i,a\}=\\log\_\{2\}e\_\{i,a\}\-\\log\_\{2\}b\_\{a\}\.

Eligibility\.Drop families whose median sequence length differs from the HMM model length by more than10%10\\%\. Per row: keep5<L<10245<L<1024; cap at100100sequences/family\.

Sampling\.For an anchor of lengthLL, use direct mapi→ii\\to ifori<min⁡\(L,model\_len\)i<\\min\(L,\\text\{model\\\_len\}\); computeΔSi,a=Si,a−Si,wt\(i\)\\Delta S\_\{i,a\}=S\_\{i,a\}\-S\_\{i,\\text\{wt\}\(i\)\}and zero\-out self\-substitutions and non\-standard residues\. The candidate pool is positions\(i,a\)\(i,a\)withΔSi,a<−1\.0\\Delta S\_\{i,a\}<\-1\.0\. Rejection\-samplekk\-tuples \(per\-family seed42\+i42\+i, up to2,0482\{,\}048proposals perkk\) withkmin=max⁡\(6,⌈−16\.0/mini⁡ΔSi\(min\)⌉\)k\_\{\\min\}=\\max\(6,\\lceil\-16\.0/\\min\_\{i\}\\Delta S\_\{i\}^\{\(\\min\)\}\\rceil\)andkmax≤50k\_\{\\max\}\\leq 50, accepting the first proposal that satisfies∑iΔSi≤−16\.0\\sum\_\{i\}\\Delta S\_\{i\}\\leq\-16\.0and pairwise position spacing≥max⁡\(6,L/8\)\\geq\\max\(6,L/8\)\.

Final\.1,821,068 anchors over 24,047 families \(of 26,796 inpfam\_sorted, after the±10%\\pm 10\\%length filter and the per\-family cap of100100\); 1,210,230 anchors \(66%66\\%\) carry a non\-nullhard\_negative\.

### 12\.3AFDB Foldseek structural pairs \(afdb\_sorted\.parquet\)

Sources\.\(i\)willdaspit/afdb\_clustered\_seqs\(HuggingFace\) — AFDB sequences with their AFDB50 representative IDs, plus per\-rowplddtandfragmentflags\. \(ii\)1\-AFDBClusters\-repId\_entryId\_cluFlag\_taxId\.tsv\.gzfromafdb\-cluster\.steineggerlab\.workers\.dev/v6/— the upstream Foldseek S\-cluster mapping\. We retaincluFlag∈\{1,2\}\\in\\\{1,2\\\}\(singleton and small\-cluster flags are excluded\)\.

Pipeline\.\(1\) Lazy\-scan HF parquets, filterplddt\>\>70 andfragment==0\. \(2\) Inner\-join onHF\.repId == Steinegger\.entry\_id; carry the Foldseek representative asgroup\_idand the AFDB50 representative asafdb50\_cluster\_id\. \(3\) Pre\-shuffle \(seed4242\) then sort byafdb50\_cluster\_id\. \(4\) Run uncapped: we use\-\-limit\_gb 0\(the CLI default of2525GB caps at∼\\sim50M rows\)\.

Final\.133,856,004 sequences in 815,712 Foldseek structural clusters spanning 1,815,626 AFDB50 representatives \(input:∼\\sim214M AFDB sequences upstream; the pLDDT\>70\>70\+ non\-fragment filter and thecluFlag∈\{1,2\}\\in\\\{1,2\\\}inner\-join produce the drop\)\.

Disambiguation\.“AFDB”, “AFDB50”, and “AFDB Foldseek clusters” name three different objects: the full AlphaFold Database; AFDB sequences clustered at50%50\\%identity \(the AFDB50 release\); and AFDB structural clusters from Foldseek over predicted structures \(the v6 S\-cluster file\)\. The contrastive positive label is the third; the underlying sequence resource is the second\.

### 12\.4STRING\-DB PPI \(stringdb\_train\.parquet\)

Sources\.STRING v12\.0protein\.sequences\.v12\.0\.fa\.gzandprotein\.physical\.links\.full\.v12\.0\.txt\.gzfromstringdb\-downloads\.org\.

Pipeline\.\(1\) Pre\-filter the FASTA to proteins that appear in some link withcombined\_score≥400\\geq 400\(STRING medium\-confidence; reduces∼\\sim59M→∼\\to\\sim25M proteins\)\. \(2\) Bernett decontamination: load theSynthyra/bernett\_gold\_ppitest split, write each test sequence under aBERNETT\_\*ID, run a single MMseqs2easy\-linclustat50%50\\%identity /80%80\\%target coverage on the union, and drop any STRING protein whose linclust cluster contains anyBERNETT\_\*member\. \(3\) Two\-stage clustering of survivors \(all\-\-cov\-mode 1\):easy\-linclustat65%/85%65\\%/85\\%then cascadedclusterat50%/75%50\\%/75\\%on stage\-1 reps, with mappings composedmember→rep65→rep50\\text\{member\}\\to\\text\{rep\}\_\{65\}\\to\\text\{rep\}\_\{50\}\. \(4\) Pair construction: filter links by score, join both endpoints to rep\-50 clusters, drop self\-cluster edges, canonicalise unordered cluster pairs, sort by score descending, deduplicate keeping the highest\-scoring edge per cluster pair\. \(5\) Length filter\[10,1024\]\[10,1024\]on both endpoints; final shuffle \(seed4242\); write only\(seq1, seq2\)\.

Final\.36,502,692 pairs over 6\.73M unique proteins per endpoint \(input: STRING v12\.0 has∼\\sim59M proteins and\>\>20B physical links; the score≥400\\geq 400filter, Bernett decontam, and cluster\-pair deduplication account for the reduction\)\.

### 12\.5ProteinGym DMS / clinical \(dms\_cosent\.parquet\)

Source\.OATML\-Markslab/ProteinGym\_v1on HuggingFace, splitsDMS\_substitutions,DMS\_indels,clinical\_substitutions,clinical\_indels\.

Pipeline\.\(1\) Continuous DMS: per\-assay z\-score ofDMS\_scoreclipped to\[−3,3\]\[\-3,3\]and rescaled to\[0,1\]\[0,1\]\. \(2\) Clinical: Pathogenic→0\.0\\to 0\.0, Benign→1\.0\\to 1\.0\. \(3\) Drop assays withDMS\_idstartingGB1\_orGFP\_AEQVI\_\(overlap with our protein\-level benchmarks\)\. \(4\) Test\-fold drop: if a recognised split column \(stage,split,set,fold\) carriestrain/testlabels, drop test rows; otherwise apply the supervised benchmark’s deterministic per\-group80/2080/20split \(RandomState\(42\), groups ofDMS\_idfor DMS rows orprotein\_idfor clinical rows\)\. Groups with fewer than1010rows are not split \(the supervised benchmark also skips these at evaluation, so this introduces no leak\)\. Pairs that appear in any test row are also removed globally as a final pass\. \(5\) Final shuffle \(seed4242\)\. Optional mutant–mutant intra\-assay pairing is disabled\.

Final\.2,175,734 pairs over 3,576 unique wild\-type targets and 2\.09M unique mutants \(input:∼\\sim3M raw ProteinGym v1 rows across the four splits; the GB1/GFP assay drop and the supervised\-test\-fold drop account for the reduction\)\. CoSENT preserves the score\-induced ordering of pairs rather than training to absolute targets\.

### 12\.6Compute and reproducibility

End\-to\-end on a single 48\-thread workstation: STRING two\-stage clustering under 6 h, AFDB join∼\\sim30 min on NVMe, Pfam hard\-negative generation∼\\sim2k anchors/s on 32 threads\. All thresholds, identity cutoffs, and seeds are fixed indata\_prep\.py; re\-running with default flags \(plus\-\-limit\_gb 0for AFDB\) reproduces every parquet bit\-for\-bit against a fixed snapshot of the upstream HuggingFace, EBI, and STRING\-DB releases\.
ProtSent: Protein Sentence Transformers

Similar Articles

The Prose of Proteins - A Lesson in Taste and Vision through the Work of Brian Hie

Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning

Better Protein Function Prediction by Modeling Survivorship Bias

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Deep Learning for Protein Complex Prediction and Design

Submit Feedback

Similar Articles

The Prose of Proteins - A Lesson in Taste and Vision through the Work of Brian Hie
Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning
Better Protein Function Prediction by Modeling Survivorship Bias
Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
Deep Learning for Protein Complex Prediction and Design