Better Protein Function Prediction by Modeling Survivorship Bias

arXiv cs.LG Papers

Summary

This paper introduces Evo-PU, a positive-unlabeled learning framework that models survivorship bias in protein sequence data by leveraging evolutionary mutation processes. The authors demonstrate that Evo-PU outperforms standard PU methods and protein language models in predicting protein functionality for influenza, RSV, and SARS-CoV-2.

arXiv:2605.06879v1 Announce Type: new Abstract: Protein sequence data from nature exhibits survivorship bias: we only observe data from those organisms that survive and reproduce, while non-functional protein mutations are eliminated by natural selection. Thus, predicting whether a protein sequence is functional often requires learning from positive examples alone. While positive-unlabeled (PU) learning frameworks offer a generic solution to this problem, existing PU methods ignore the evolutionary processes that shape sequence observability and cause survivorship bias. Consider a sequence that is one mutation away from a commonly-observed protein variant in a well-surveilled organism. If the sequence were functional, it would likely be observed. If it is not observed, this suggests non-functionality. In contrast, sequences that are unlikely to arise through mutation may be missing simply because they never arose. Thus, these two kinds of missing sequences should be treated differently when training models. In this work, we propose Evo-PU, a PU learning framework that uses a scientific understanding of nucleotide mutation to model survivorship bias for well-surveilled single-organism sequence data. On three prediction tasks using single-organism uniform-coverage surveillance data -- predicting results from held-out influenza and respiratory syncytial virus (RSV) mutagenesis studies, and predicting future SARS-CoV-2 variants -- Evo-PU outperforms standard PU learning, one-class classification (OCC), and protein language models (PLMs). On prediction tasks from multi-organism ProteinGym datasets with more heterogeneous surveillance coverage, we identify opportunities to generalize our approach.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:02 AM

# Better Protein Function Prediction by Modeling Survivorship Bias
Source: [https://arxiv.org/html/2605.06879](https://arxiv.org/html/2605.06879)
\\NAT@set@cites

Zhongmou Chao1 zc83@cornell\.edu &Poompol Buathong211footnotemark:1 pb482@cornell\.edu &Ekaterina Selivanovitch1 es838@cornell\.edu Susan Daniel1 sd386@cornell\.edu &Peter I\. Frazier3 pf98@cornell\.edu

1Smith School of Chemical and Biomolecular Engineering, Cornell University, USA 2Center for Applied Mathematics, Cornell University, USA 3School of Operations Research and Information Engineering, Cornell University, USA

###### Abstract

Protein sequence data from nature exhibits survivorship bias: we only observe data from those organisms that survive and reproduce, while non\-functional protein mutations are eliminated by natural selection\. Thus, predicting whether a protein sequence is functional often requires learning from positive examples alone\. While positive\-unlabeled \(PU\) learning frameworks offer a generic solution to this problem, existing PU methods ignore the evolutionary processes that shape sequence observability and cause survivorship bias\. Consider a sequence that is one mutation away from a commonly\-observed protein variant in a well\-surveilled organism\. If the sequence were functional, it would likely be observed\. If it is not observed, this suggests non\-functionality\. In contrast, sequences that are unlikely to arise through mutation may be missing simply because they never arose\. Thus, these two kinds of missing sequences should be treated differently when training models\. In this work, we propose Evo\-PU, a PU learning framework that uses a scientific understanding of nucleotide mutation to model survivorship bias for well\-surveilled single\-organism sequence data\. On three prediction tasks using single\-organism uniform\-coverage surveillance data—predicting results from held\-out influenza and respiratory syncytial virus \(RSV\) mutagenesis studies, and predicting future SARS\-CoV\-2 variants—Evo\-PU outperforms standard PU learning, one\-class classification \(OCC\), and protein language models \(PLMs\)\. On prediction tasks from multi\-organism ProteinGym datasets with more heterogeneous surveillance coverage, we identify opportunities to generalize our approach\.

## 1Introduction

Understanding the relationship between protein sequence and function is central to environmental sustainability, human health, and materials science\(Tournieret al\.,[2023](https://arxiv.org/html/2605.06879#bib.bib162); Tiller and Tessier,[2015](https://arxiv.org/html/2605.06879#bib.bib163); Weiet al\.,[2016](https://arxiv.org/html/2605.06879#bib.bib164)\)\. However, existing protein datasets are not unbiased samples of sequence space: they are shaped by evolutionary and experimental selection processes that systematically overrepresent functional sequences while filtering out non\-functional ones\. This creates a survivorship bias\(Bermúdez\-Guzmánet al\.,[2020](https://arxiv.org/html/2605.06879#bib.bib156); Thomaset al\.,[2022](https://arxiv.org/html/2605.06879#bib.bib157)\)that fundamentally limits standard supervised learning approaches\.

This bias has two sources\. First, natural evolution preferentially retains functional variants through selection, causing protein databases to be dominated by sequences that enhance organism survival\. Second, experimental selection protocols such as biopanning systematically enrich for sequences with desired properties \(e\.g\., binding affinity\) while discarding those that fail to meet functional thresholds\(Giordanoet al\.,[2001](https://arxiv.org/html/2605.06879#bib.bib165); McGuireet al\.,[2009](https://arxiv.org/html/2605.06879#bib.bib166)\)\. As a result, protein datasets contain predominantly positive examples and few verified negatives\.

This structure makes protein function prediction a positive\-unlabeled \(PU\) learning problem\(Liuet al\.,[2003](https://arxiv.org/html/2605.06879#bib.bib73); Bekker and Davis,[2020](https://arxiv.org/html/2605.06879#bib.bib91)\), where observed sequences are functional \(positive\) and all others are unlabeled—they may be functional but unobserved, or truly non\-functional\. Existing PU methods address this setting by introducing a class prior: the probability that an unlabeled sequence is truly positive\. However, these methods typically assume a constant class prior across all sequences, ignoring the fact that a sequence’s probability of being observed depends on its evolutionary accessibility\. This assumption is biologically unrealistic and limits model performance: sequences that are one mutation away from highly prevalent variants are far more likely to have been observed \(if functional\) than sequences requiring multiple simultaneous mutations\.

Alternative approaches include one\-class classification \(OCC\)\(Tax and Duin,[2001](https://arxiv.org/html/2605.06879#bib.bib66); Pereraet al\.,[2021](https://arxiv.org/html/2605.06879#bib.bib50)\)and protein language models \(PLMs\) trained on multiple sequence alignments \(MSA\) to capture evolutionary constraints and estimate functional likelihood\(Meieret al\.,[2021](https://arxiv.org/html/2605.06879#bib.bib159); Frazeret al\.,[2021](https://arxiv.org/html/2605.06879#bib.bib59); Thadaniet al\.,[2023](https://arxiv.org/html/2605.06879#bib.bib137)\)\. While PLMs effectively predict overall protein fitness, they are less suited for capturing fine\-grained functional signals from short, locally acting peptide motifs\. We provide a detailed review of relevant literature in Appendix[A](https://arxiv.org/html/2605.06879#A1)\.

To address the limitations of PU learning, OCC, and PLM\-based approaches, we propose Evo\-PU, a PU learning framework that models survivorship bias through an evolution\-informed, sequence\-dependent class prior\. Our key insight is that functional nucleotide sequences closer to those prevalent in nature are more likely to be observed than functional sequences that are evolutionarily distant, since nearby sequences are more likely to arise through feasible mutational pathways\. By modeling protein evolution at the nucleotide level, Evo\-PU captures how mutational accessibility and prevalence jointly determine which functional sequences are observed\. This approach is specifically designed for single\-organism sequence data with uniform surveillance coverage, where the evolutionary process is well\-characterized and observation probabilities are relatively homogeneous\. This focus provides a foundation for future work extending to multi\-organism datasets with more heterogeneous surveillance coverage\.

We evaluate Evo\-PU on multiple prediction tasks using single\-organism surveillance data: predicting results from held\-out influenza and respiratory syncytial virus \(RSV\) mutagenesis studies, and predicting future SARS\-CoV\-2 variants\. Across all tasks, Evo\-PU outperforms standard PU learning, one\-class classification \(OCC\), and protein language model \(PLM\) methods\. We further evaluate Evo\-PU on multi\-organism ProteinGym benchmarks\(Notinet al\.,[2023](https://arxiv.org/html/2605.06879#bib.bib158)\)to identify opportunities to generalize our approach beyond well\-surveilled single\-organism data\.

## 2Evo\-PU method

We now formalize the data\-generating process for protein sequence observation and derive the Evo\-PU likelihood\. Section[2\.1](https://arxiv.org/html/2605.06879#S2.SS1)introduces the probabilistic model and Section[2\.2](https://arxiv.org/html/2605.06879#S2.SS2)relates it to existing methods\. Section[2\.3](https://arxiv.org/html/2605.06879#S2.SS3)presents a nucleotide emergence model used to estimate the probability that a nucleotide sequence will arise through mutational pathways from ancestor sequences, which is a key component in the Evo\-PU likelihood\. This leverages a nucleotide mutation model presented in Section[2\.4](https://arxiv.org/html/2605.06879#S2.SS4)\. Finally, since computing our exact likelihood becomes computationally infeasible as the amino acid sequence grows in length, we propose an accurate fast approximation method in Section[2\.5](https://arxiv.org/html/2605.06879#S2.SS5)\.

### 2\.1Data\-generating process and likelihood

We model the process by which functional sequences are observed through mutation, selection, and surveillance\.

Let𝒜\\mathcal\{A\}denote the set of 20 natural amino acids\. Let𝒳\\mathcal\{X\}be the set of amino acid sequences with a given lengthLL\. LetA​\(x\)∈\{0,1\}A\(x\)\\in\\\{0,1\\\}indicate whether amino acid sequencex∈𝒳x\\in\\mathcal\{X\}exhibits a functional property of interest\. We aim to estimate the parametersθ\\thetain a probabilistic classifierpa​\(x;θ\)=P​\(A​\(x\)=1\)p\_\{a\}\(x;\\theta\)=P\(A\(x\)=1\)\. Our approach is agnostic to the classifier used and supports any continuously\-differentiable neural architecture\.

In biological systems, amino acid sequences are produced by translating nucleotide sequences through the genetic code\. Each amino acid is encoded by a codon of three nucleotides\. Let𝒩\\mathcal\{N\}denote the set of nucleotides \(either\{A,C,G,T\}\\\{A,C,G,T\\\}for DNA or\{A,C,G,U\}\\\{A,C,G,U\\\}for RNA\)\. Some codons are stop signals and do not encode amino acids; we define𝒴\\mathcal\{Y\}to be the set of valid nucleotide sequences of length3​L3Lthat translate to amino acid sequences in𝒳\\mathcal\{X\}\. LetB:𝒴→𝒳B:\\mathcal\{Y\}\\to\\mathcal\{X\}denote the biological translation map\. For a given amino acid sequencexx, we define𝒴​\(x\)=\{y∈𝒴:B​\(y\)=x\}\\mathcal\{Y\}\(x\)=\\\{\\,y\\in\\mathcal\{Y\}:B\(y\)=x\\,\\\}to be the set of nucleotide sequences encodingxx\.

New nucleotide sequences arise through mutation and may or may not persist depending on biological viability and selective pressures\. We use the term*emergence*to denote the event that a nucleotide sequence is generated by mutation\. LetE​\(y\)∈\{0,1\}E\(y\)\\in\\\{0,1\\\}indicate whether nucleotide sequenceyyemerges\. Letα\\alphabe a vector of unknown nuisance parameters in a second probabilistic modelpe​\(y;α\)=P​\(E​\(y\)=1\)p\_\{e\}\(y;\\alpha\)=P\(E\(y\)=1\)\. Details on the functional form ofpep\_\{e\}are given in Section[2\.3](https://arxiv.org/html/2605.06879#S2.SS3)\.

We letO𝒴​\(y\)∈\{0,1\}O\_\{\\mathcal\{Y\}\}\(y\)\\in\\\{0,1\\\}indicate whether nucleotide sequenceyyis observed and let𝒟𝒴\\mathcal\{D\}\_\{\\mathcal\{Y\}\}indicate the set of observed nucleotide sequences\. We assumeyyis observed if it emerges, encodes a functional protein, and is detected by a surveillance process that has some unknown probabilitypop\_\{o\}of succeeding\. Thus,

P​\(O𝒴​\(y\)=1∣E​\(y\),A​\(B​\(y\)\)\)=po​E​\(y\)​A​\(B​\(y\)\)\.P\(O\_\{\\mathcal\{Y\}\}\(y\)=1\\mid E\(y\),A\(B\(y\)\)\)=p\_\{o\}\\,E\(y\)\\,A\(B\(y\)\)\.
We letO𝒳​\(x\)O\_\{\\mathcal\{X\}\}\(x\)indicate whether amino acid sequencexxis observed and let𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}indicate the set of observed such sequences\. We have the relationO𝒳​\(x\)=𝕀​\(∃y∈𝒴​\(x\)​such that​O𝒴​\(y\)=1\)O\_\{\\mathcal\{X\}\}\(x\)=\\mathbb\{I\}\\big\(\\exists\\,y\\in\\mathcal\{Y\}\(x\)\\text\{ such that \}O\_\{\\mathcal\{Y\}\}\(y\)=1\\big\)\. Combining functionality, emergence, and observability yields

P​\(O𝒳​\(x\)=1\)=pa​\(x;θ\)​\[1−∏y∈𝒴​\(x\)\(1−po​pe​\(y;α\)\)\]\.P\(O\_\{\\mathcal\{X\}\}\(x\)=1\)=p\_\{a\}\(x;\\theta\)\\left\[1\-\\prod\_\{y\\in\\mathcal\{Y\}\(x\)\}\\left\(1\-p\_\{o\}\\,p\_\{e\}\(y;\\alpha\)\\right\)\\right\]\.The second term on the right defines a sequence\-dependent class prior corresponding to the probability that a functional sequencexxis observed\.

Given observed amino acid sequences𝒟𝒳⊂𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}\\subset\\mathcal\{X\}, we define the Evo\-PU log\-likelihood as

∑x∈𝒟𝒳log⁡P​\(O𝒳​\(x\)=1\)\+∑x′∈𝒟𝒳′log⁡\(1−P​\(O𝒳​\(x′\)=1\)\),\\sum\_\{x\\in\\mathcal\{D\}\_\{\\mathcal\{X\}\}\}\\log P\(O\_\{\\mathcal\{X\}\}\(x\)=1\)\\\!\+\\\!\\sum\_\{x^\{\\prime\}\\in\\mathcal\{D\}\_\{\\mathcal\{X\}\}^\{\\prime\}\}\\log\\big\(1\-P\(O\_\{\\mathcal\{X\}\}\(x^\{\\prime\}\)=1\)\\big\),\(1\)where𝒟𝒳′\\mathcal\{D\}\_\{\\mathcal\{X\}\}^\{\\prime\}is the complement of𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}\.

Computing this likelihood is computationally challenging for large sequence lengthsLLbecause𝒟𝒳′\\mathcal\{D\}\_\{\\mathcal\{X\}\}^\{\\prime\}grows exponentially withLL\(assuming that𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}stays bounded in size\)\. Section[2\.5](https://arxiv.org/html/2605.06879#S2.SS5)describes an approximation that restricts attention to a set of unobserved sequences with high\-emergence probability, generated via nucleotide\-level mutation from observed data\. This includes the terms from the complement of𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}that have the largest effect on Eq\. \([1](https://arxiv.org/html/2605.06879#S2.E1)\)\. This provides tractable computation while aiming to accurately approximate an exact log\-likelihood\.

We train the classifier to estimateθ\\thetaand the nuisance parametersα\\alpha,pop\_\{o\}by maximizing this approximated log\-likelihood Eq\. \([1](https://arxiv.org/html/2605.06879#S2.E1)\) plus a regularization penalty\. We use a penalty proportional to‖θ‖2\|\|\\theta\|\|^\{2\}though other penalties are likely to perform similarly\.

### 2\.2Comparison of Evo\-PU with PU\-learning and PLM\-based methods

In this section, we compare our Evo\-PU likelihood in Eq\. \([1](https://arxiv.org/html/2605.06879#S2.E1)\) to existing PU\-learning likelihood formulations and discuss the central distinctions between Evo\-PU and PLM\-based methods\.

Comparison to existing PU\-learning likelihoods\.Here we present two related likelihood formulations within the PU\-learning framework:

- •Classical binary classifierlikelihood, which assumes unobserved sequences lack the functional property: ∑x∈𝒟𝒳log⁡pa​\(x;θ\)\+∑x′∈𝒟𝒳′log⁡\(1−pa​\(x′;θ\)\);\\sum\_\{x\\in\\mathcal\{D\}\_\{\\mathcal\{X\}\}\}\\log p\_\{a\}\(x;\\theta\)\+\\sum\_\{x^\{\\prime\}\\in\\mathcal\{D\}\_\{\\mathcal\{X\}\}^\{\\prime\}\}\\log\(1\-p\_\{a\}\(x^\{\\prime\};\\theta\)\);
- •Protein\-PUlikelihood proposed bySonget al\.\([2021](https://arxiv.org/html/2605.06879#bib.bib75)\), which incorporates a fixed labeling efficiency parameterq∈\(0,1\)q\\in\(0,1\): ∑x∈𝒟𝒳log⁡q​pa​\(x;θ\)\+∑x′∈𝒟𝒳′log⁡\(1−q​pa​\(x′;θ\)\)\.\\sum\_\{x\\in\\mathcal\{D\}\_\{\\mathcal\{X\}\}\}\\log qp\_\{a\}\(x;\\theta\)\+\\sum\_\{x^\{\\prime\}\\in\\mathcal\{D\}\_\{\\mathcal\{X\}\}^\{\\prime\}\}\\log\(1\-qp\_\{a\}\(x^\{\\prime\};\\theta\)\)\.

All three likelihoods share a similar structure: a sum over observed sequences in𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}and a second sum over sequences not in𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}\.

The classical likelihood can be viewed as a special case of both Evo\-PU and Protein\-PU\. Settingpo​pe​\(y;α\)=1p\_\{o\}p\_\{e\}\(y;\\alpha\)=1for ally∈𝒴​\(x\)y\\in\\mathcal\{Y\}\(x\)in Eq\. \([1](https://arxiv.org/html/2605.06879#S2.E1)\) implies that every functional amino acid sequence is always observed, reducing Evo\-PU to the classical likelihood\. Likewise, settingq=1q=1in Protein\-PU also recovers the classical form\.

Comparing Evo\-PU and Protein\-PU directly, Protein\-PU models labeling efficiency with a constant parameterqq, representing the probability that a functional sequence is labeled\. In contrast, Evo\-PU models class prior as sequence\-dependent through the term1−∏y∈𝒴​\(x\)\(1−po​pe​\(y;α\)\)1\-\\prod\_\{y\\in\\mathcal\{Y\}\(x\)\}\(1\-p\_\{o\}p\_\{e\}\(y;\\alpha\)\), which reflects how likely a sequence is to emerge through mutational processes and be observed\. This sequence\-dependent formulation captures variability in the observation process that cannot be explained by a fixed efficiency parameter, leading to better alignment with the natural data\-generating process and improved predictive performance\.

Distinctions between Evo\-PU and PLM\-based methods\.We highlight several key distinctions between Evo\-PU and PLM\-based approaches\. First, PLM\-based methods primarily capture patterns associated with overall evolutionary fitness, whereas Evo\-PU is designed to identify sequence features that govern a specific biochemical property essential for organismal survival\. Second, Evo\-PU bases its predictions on an explicit model of why certain sequences are observed or missing, rather than relying solely on the empirical distribution of sequences in biological databases\. While PLMs infer statistical constraints from observed sequences, they do not model the evolutionary and surveillance processes that determine the presence or absence of variants\. Moreover, many PLM\-based methods make predictions by evaluating the likelihood of mutations relative to a single reference or wild\-type sequence\. In contrast, Evo\-PU explicitly models protein evolution by considering mutational pathways from multiple previously observed sequences, thereby accounting for the fact that a given sequence may arise through diverse evolutionary trajectories\.

### 2\.3Nucleotide emergence model

The probabilitype​\(y;α\)p\_\{e\}\(y;\\alpha\)captures how likely an unobserved nucleotide sequenceyyis to emerge through mutation from an ancestor\. To model this probability, we assume that all nucleotide sequences observed,𝒟𝒴\\mathcal\{D\}\_\{\\mathcal\{Y\}\}, are available as ancestors\. The sequenceyycan emerge through multiple mutational pathways from any such ancestor, with more prevalent ancestors contributing more opportunities for mutation\.

With these observations, an unobserved nucleotide sequenceyycan emerge with probability,

pe​\(y;α\)=1−∏y′∈𝒟𝒴\(1−P​\(y′→y\)​α\)c​\(y′\),p\_\{e\}\(y;\\alpha\)=1\-\\prod\_\{y^\{\\prime\}\\in\\mathcal\{D\}\_\{\\mathcal\{Y\}\}\}\\left\(1\-P\(y^\{\\prime\}\\rightarrow y\)\\,\\alpha\\right\)^\{c\(y^\{\\prime\}\)\},\(2\)whereP​\(y′→y\)P\(y^\{\\prime\}\\rightarrow y\)is the mutation probability fromy′y^\{\\prime\}toyyduring a single generation \(see Section[2\.4](https://arxiv.org/html/2605.06879#S2.SS4)\),α\\alphais a hyperparameter that will be estimated and allows the effective emergence rate to scale to account for phenomena such as a functional sequence’s failure to reproduce, andc​\(y′\)c\(y^\{\\prime\}\)is the assumed number of opportunities that ancestor sequencey′y^\{\\prime\}had to create progeny\. This formulation reflects that higher prevalence and closer mutational distance both increase emergence probability\. For sequencesyyalready observed, we setpe​\(y;α\)=1p\_\{e\}\(y;\\alpha\)=1\.

BothP​\(y′→y\)P\(y^\{\\prime\}\\rightarrow y\)andα\\alphaare typically small as the mutation rate is low, while the countsc​\(y′\)c\(y^\{\\prime\}\)are large\. Using the classical approximation\(1\+x\)a≈ea​x\(1\+x\)^\{a\}\\approx e^\{ax\}for\|x\|≪1\|x\|\\ll 1and\|a​x\|≫1\|ax\|\\gg 1, we obtain

pe​\(y;α\)≈1−exp⁡\(−∑y′∈𝒟𝒴P​\(y′→y\)​α​c​\(y′\)\)\.p\_\{e\}\(y;\\alpha\)\\;\\approx\\;1\-\\exp\\Big\(\-\\sum\_\{y^\{\\prime\}\\in\\mathcal\{D\}\_\{\\mathcal\{Y\}\}\}P\(y^\{\\prime\}\\rightarrow y\)\\,\\alpha\\,c\(y^\{\\prime\}\)\\Big\)\.\(3\)
We estimate the quantityc​\(y\)c\(y\)of eachy∈𝒟𝒴y\\in\\mathcal\{D\}\_\{\\mathcal\{Y\}\}as proportional to the empirical frequency ofyyin the observed data\. We scale by the estimated number of unique organismsTTthat had progeny during the period over which the data was collected\. Our approach effectively scales this by the additional proportionality constantα\\alpha, as explained above\.

### 2\.4Mutation probability model

The mutation probabilityP​\(y′→y\)P\(y^\{\\prime\}\\rightarrow y\)depends on the number of nucleotide differences betweeny′y^\{\\prime\}andyyand the types of mutations involved\. Consider the set of four RNA nucleotides: adenine \(A\), guanine \(G\), cytosine \(C\), and uracil \(U\)\. RNA nucleotide mutations occur via two mechanisms\(Luoet al\.,[2016](https://arxiv.org/html/2605.06879#bib.bib128)\):*transition*\(purine to purine or pyrimidine to pyrimidine\) and*transversion*\(purine to pyrimidine or vice versa\)\. Transitions are more common than transversions due to chemical similarities\. We provide possible RNA mutations in Appendix[A1](https://arxiv.org/html/2605.06879#A2.T1)\. For sequences differing at a single nucleotide position, we modelP​\(y′→y\)P\(y^\{\\prime\}\\rightarrow y\)as proportional to the appropriate transition or transversion rate\. For sequences differing at multiple positions, the probability decreases exponentially with the number of differences, making such mutations negligible in our approximation\.

### 2\.5Efficient approximation of the likelihood

Exact computation of Eq\. \([1](https://arxiv.org/html/2605.06879#S2.E1)\) is computationally intractable for largeLLbecause the full set of unobserved sequences is exponentially large inLL\.

Here we describe an approximate method that leverages the fact that mutations are rare, especially simultaneous mutations to multiple amino acids\. As a result, most nucleotide sequences in𝒴​\(x\)\\mathcal\{Y\}\(x\)that have not been observed in the nucleotide dataset𝒟𝒴\\mathcal\{D\}\_\{\\mathcal\{\\mathcal\{Y\}\}\}have negligible emergence probabilitype​\(y;α\)≈0p\_\{e\}\(y;\\alpha\)\\approx 0, and thus make only a minor contribution to the likelihood \([1](https://arxiv.org/html/2605.06879#S2.E1)\) if included\.

Our approach approximates the likelihood by considering a smaller subset of amino acid sequences𝒟^𝒳′⊂𝒟𝒳′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{X\}\}^\{\\prime\}\\subset\\mathcal\{D\}\_\{\\mathcal\{X\}\}^\{\\prime\}, generated by the observed nucleotide sequence dataset𝒟𝒴\\mathcal\{D\}\_\{\\mathcal\{Y\}\}, and containing only sequences likely to emerge naturally, yet unobserved\. Specifically, we construct𝒟^𝒳′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{X\}\}^\{\\prime\}by first generating a set of unobserved nucleotide sequences𝒟^𝒴\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}that contains nucleotide sequences with one point mutation away from any observed nucleotide sequence in observed set𝒟𝒴\\mathcal\{D\}\_\{\\mathcal\{Y\}\}\. In𝒟^𝒴\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}, we include only those unobserved nucleotide sequences whose emergence probabilitype​\(y′;α\)\>ϵp\_\{e\}\(y^\{\\prime\};\\alpha\)\>\\epsilonfor some fixedϵ,α\>0\\epsilon,\\alpha\>0\. Then, we construct𝒟^𝒳′=\{B​\(y′\)∈𝒳:y′∈𝒟^𝒴​and​B​\(y′\)∉𝒟𝒳\}\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{X\}\}^\{\\prime\}=\\\{B\(y^\{\\prime\}\)\\in\\mathcal\{X\}:y^\{\\prime\}\\in\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}\\text\{ and \}B\(y^\{\\prime\}\)\\not\\in\\mathcal\{D\}\_\{\\mathcal\{X\}\}\\\}\.

Moreover, to further reduce the computation in the term for class prior∏y∈𝒴​\(x\)\(1−po​pe​\(y;α\)\)\\prod\_\{y\\in\\mathcal\{Y\}\(x\)\}\(1\-p\_\{o\}p\_\{e\}\(y;\\alpha\)\), we restrict𝒴​\(x\)\\mathcal\{Y\}\(x\)for anyx∈𝒟𝒳∪𝒟^𝒳′x\\in\\mathcal\{D\}\_\{\\mathcal\{X\}\}\\cup\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{X\}\}^\{\\prime\}to𝒴^​\(x\)=𝒴​\(x\)∩\(𝒟𝒴∪𝒟^𝒴′\)\\hat\{\\mathcal\{Y\}\}\(x\)=\\mathcal\{Y\}\(x\)\\cap\(\\mathcal\{D\}\_\{\\mathcal\{Y\}\}\\cup\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}^\{\\prime\}\)\.

By replacing𝒟𝒳′\\mathcal\{D\}\_\{\\mathcal\{X\}\}^\{\\prime\}with the subset𝒟^𝒳′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{X\}\}^\{\\prime\}and𝒴​\(x\)\\mathcal\{Y\}\(x\)with𝒴^​\(x\)\\hat\{\\mathcal\{Y\}\}\(x\)in Eq\. \([1](https://arxiv.org/html/2605.06879#S2.E1)\), and using the emergence probabilitiespe​\(y;α\)=1,∀y∈𝒟𝒴p\_\{e\}\(y;\\alpha\)=1,\\forall y\\in\\mathcal\{D\}\_\{\\mathcal\{Y\}\}andpe​\(y′;α\)p\_\{e\}\(y^\{\\prime\};\\alpha\)as defined in Eq\. \([3](https://arxiv.org/html/2605.06879#S2.E3)\) for ally′∈𝒟^𝒴′y^\{\\prime\}\\in\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}^\{\\prime\}, we approximate the log\-likelihood function by:

ℓn​\(θ,po,α;𝒟𝒳\)≈ℓ^n​\(θ,po,α;𝒟𝒳\):=\\displaystyle\\ell\_\{n\}\(\\theta,p\_\{o\},\\alpha;\\mathcal\{D\}\_\{\\mathcal\{X\}\}\)\\approx\\hat\{\\ell\}\_\{n\}\(\\theta,p\_\{o\},\\alpha;\\mathcal\{D\}\_\{\\mathcal\{X\}\}\):=∑x∈𝒟𝒳\[log⁡pa​\(x;θ\)\+log⁡\(1−∏y∈𝒴^​\(x\)\(1−po​pe​\(y;α\)\)\)\]\\displaystyle\\sum\_\{x\\in\\mathcal\{D\}\_\{\\mathcal\{X\}\}\}\\Big\[\\log p\_\{a\}\(x;\\theta\)\+\\log\\Big\(1\\\!\-\\\!\\prod\_\{y\\in\\hat\{\\mathcal\{Y\}\}\(x\)\}\(1\-p\_\{o\}p\_\{e\}\(y;\\alpha\)\)\\Big\)\\Big\]\+∑x′∈𝒟𝒳′log⁡\[1−pa​\(x′;θ\)​\(1−∏y′∈𝒴^​\(x′\)\(1−po​pe​\(y′;α\)\)\)\]\\displaystyle\\\!\+\\\!\\sum\_\{x^\{\\prime\}\\in\\mathcal\{D\}\_\{\\mathcal\{X\}\}^\{\\prime\}\}\\log\\Big\[1\\\!\-\\\!p\_\{a\}\(x^\{\\prime\};\\theta\)\\big\(1\\\!\-\\\!\\prod\_\{y^\{\\prime\}\\in\\hat\{\\mathcal\{Y\}\}\(x^\{\\prime\}\)\}\(1\\\!\-\\\!p\_\{o\}p\_\{e\}\(y^\{\\prime\};\\alpha\)\)\\big\)\\Big\]We then train the probabilistic classifierpa​\(x;θ\)p\_\{a\}\(x;\\theta\)by jointly estimating the classifier parametersθ\\thetaand two nuisance parameters: nucleotide observability efficiencypop\_\{o\}, and the probability that an emerged sequence becomes dominantα\\alphaby minimizing the loss function defined as the negative of this approximated log\-likelihood:

\(θ∗,po∗,α∗\)∈arg​min\(θ,po,α\)∈Θ×\(0,1\)×\(0,1\)−ℓ^n​\(θ,po,α;𝒟𝒳\)\.\(\\theta^\{\*\},p\_\{o\}^\{\*\},\\alpha^\{\*\}\)\\in\\operatorname\*\{arg\\,min\}\_\{\(\\theta,p\_\{o\},\\alpha\)\\in\\Theta\\times\(0,1\)\\times\(0,1\)\}\-\\hat\{\\ell\}\_\{n\}\(\\theta,p\_\{o\},\\alpha;\\mathcal\{D\}\_\{\\mathcal\{X\}\}\)\.\(5\)

## 3Numerical experiments

We evaluate Evo\-PU across two complementary data regimes: well\-surveilled single\-organism datasets and multi\-organism protein benchmarks\. The single\-organism setting uses large\-scale viral genomic surveillance data from influenza, RSV and SARS\-CoV\-2, where survivorship bias and mutational accessibility are especially informative\. In these tasks, Evo\-PU is tested on its ability to identify sequence features that control specific viral functions and to anticipate properties of newly emerging variants over evolutionary time\. To explore broader applicability beyond this regime, we also consider the ProteinGym benchmarks\(Notinet al\.,[2023](https://arxiv.org/html/2605.06879#bib.bib158)\), which evaluate prediction of overall protein fitness across diverse protein families, providing a contrasting multi\-organism setting\.

### 3\.1Single\-organism tasks

#### 3\.1\.1Predicting functional motifs of various viral proteins

Problem background: Key steps of viral infection—immune evasion, host receptor binding, and membrane fusion—are mediated by short functional motifs within viral proteins\. Ininfluenzahemagglutinin, the Ca1 epitope \(11 residues; positions 169–173, 206–208, 238–240 in H3 numbering\) drives immune evasion, the receptor\-binding domain \(23 residues; positions 134–138, 186–195, 221–228 in H3 numbering\) mediates host attachment, and fusion peptide \(23 residues; position 1\-23 of the HA2 subunit\) enables membrane fusion\. Inrespiratory syncytial virus \(RSV\), the heptad repeat C \(HRC\) domain in the fusion protein \(23 residues; position 75\-97\) provides the mechanical force for fusion\. InSARS\-CoV\-2, the Spike fusion peptide \(48 residues; positions 808–855\) is highly conserved and regulates viral entry\. We evaluate Evo\-PU on predicting motif variants associated with these functions across viruses\.

Dataset:From influenza hemagglutinin nucleotide sequences \(year 2001–2024\), we extracted 7,383 unique nucleotide sequences encoding 504 fusion peptide variants\. For binding peptides, restricting to human\-infecting strains yielded 3,862 nucleotide sequences encoding 1,458 variants\. For the Ca1 epitope, we used H1 human sequences \(2001–2024\), obtaining 497 nucleotide sequences encoding 181 variants\. For RSV, surveillance data collected before 2026 yielded 366 nucleotide sequences encoding 73 HRC variants\. For SARS\-CoV\-2, 2\.8 million sequences collected before Oct 2021 were processed to obtain 657 nucleotide sequences encoding 357 fusion peptide variants\. In all cases, nucleotide datasets form observed set𝒟𝒴\\mathcal\{D\}\_\{\\mathcal\{Y\}\}, and translated amino acid datasets form𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}\. These sets are used to compute the emergence probability presented in the model proposed in Section[2\.3](https://arxiv.org/html/2605.06879#S2.SS3)as a part of the approximated log\-likelihood function in Eq\. \(LABEL:eqn:approxlike\)\. Test datasets differ by task\. For influenza fusion peptides, 76 mutants from site\-directed mutagenesis studies were used \(46 functional, 30 impaired\)\. For binding peptides, 33 lab\-generated mutants plus 11 newly observed sequences in year 2025 were used \(23 binding, 21 non\-binding\)\. For evasion \(Ca1\), 51 observed sequences in year 2025 were treated as positives, while 51 negatives were generated by introducing nine nucleotide mutations into observed sequences\. For RSV, mutagenesis studies provided 25 test sequences \(10 functional, 15 impaired\)\. For SARS\-CoV\-2, 19 post\-Oct 2021 observed variants were used as positives, together with 19 randomly generated 10\-mutation variants treated as negatives\. Details on the source of training and test data are included in Appendix[C](https://arxiv.org/html/2605.06879#A3)\.

#### 3\.1\.2Predicting emerging SARS\-CoV\-2 fusion peptide variants

Problem background:The highly conserved nature of SARS\-CoV\-2 fusion peptide makes it a stable therapeutic target\. Yet, rare emergence of a mutation that compromises current therapeutics could have severe consequences\. Hence, predicting not\-yet\-observed fusion peptide mutations enables proactive measures against future escape variants\.

Dataset:We used the same set of 657 unique nucleotide sequences that encode 357 unique amino acid sequences observed by Oct 2021 as training data\. These sequences form the nucleotide observation set𝒟𝒴\\mathcal\{D\}\_\{\\mathcal\{Y\}\}, and the translated amino\-acid observation set𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}, used in Evo\-PU in the same way as the previous tasks\. A total of 19 new fusion peptide variants \(observed post\-Oct 2021\) with observed frequency were used as test data\.

### 3\.2Multiple\-organism tasks

ProteinGym\(Notinet al\.,[2023](https://arxiv.org/html/2605.06879#bib.bib158)\)is a benchmark for protein\-fitness prediction containing standardized DMS assays and curated clinical datasets with annotated mutation effects\. We evaluate Evo\-PU on two ProteinGym tasks: \(1\) PSAE\_PICP2 \(PSAE\) and \(2\) A0A247D711\_LISMN \(A0\)\. Evo\-PU is trained on the associated multiple sequence alignment \(MSA\) sequences, which form the amino acid observation set𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}\(1,785 sequences for PSAE; 57 for A0\), and evaluated on the corresponding DMS substitution datasets \(1,581 sequences for PSAE; 1,653 for A0\)\. As our evolutionary model operates at the nucleotide level and requires prevalence information–data not available in ProteinGym–we randomly sample a nucleotide sequence encoding each amino\-acid MSA sequence and assume equal prevalence across all sequences to construct the nucleotide observation set𝒟𝒴\\mathcal\{D\}\_\{\\mathcal\{Y\}\}when running Evo\-PU\. Although Evo\-PU is not designed for general protein\-fitness prediction of multiple\-organism protein data, the goal of these experiments is precisely to assess how well it performs in this broader setting\.

### 3\.3Evo\-PU: model choices

Evo\-PU can be used with any probabilistic classifier\. We consider logistic regression \(LR\) as a simple baseline and a neural network inspired by the Wide and Deep architecture \(WD\)Chenget al\.\([2016](https://arxiv.org/html/2605.06879#bib.bib152)\)as a more expressive model\. We provide the description of the neural network in Appendix[D](https://arxiv.org/html/2605.06879#A4)\.

As described in Section[2\.1](https://arxiv.org/html/2605.06879#S2.SS1), we train the probabilistic classifier by optimizing the log\-likelihood in Eq\. \(LABEL:eqn:approxlike\) withℓ2\\ell\_\{2\}regularization\. The default penalty coefficient is set toλ=50\\lambda=50, and its effect on all single\-organism tasks is evaluated in Appendix[E](https://arxiv.org/html/2605.06879#A5)\.

The parameterTTdenotes the estimated number of infected hosts in the preceding period\. For influenza datasets,𝒟𝒴\\mathcal\{D\}\_\{\\mathcal\{Y\}\}spans 24 years \(2001–2024\); using the estimate of 1 billion influenza infections annually\(Nairet al\.,[2011](https://arxiv.org/html/2605.06879#bib.bib56)\), we setT=24T=24B for fusion and binding tasks, andT=12T=12B for the evasion task since it only uses Hemagglutinin subtype 1 data\. For SARS\-CoV\-2, although the total number of infections is uncertain, we assume 1 billion global infections by Oct 2021\. Because information for estimatingTTin RSV and ProteinGym tasks is limited, we use the default valueT=24T=24B\. Sensitivity toTTis further examined through an ablation study in Appendix[F](https://arxiv.org/html/2605.06879#A6)\.

To generate𝒟^𝒴′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}^\{\\prime\}, for eachy∈𝒟𝒴y\\in\\mathcal\{D\}\_\{\\mathcal\{Y\}\}, we generate all possible single\-nucleotide mutantsy′y^\{\\prime\}, considering both transition and transversion mechanisms\(Luoet al\.,[2016](https://arxiv.org/html/2605.06879#bib.bib128)\)as discussed in Section[2\.4](https://arxiv.org/html/2605.06879#S2.SS4)\. Following prior studies\(Wakeley,[1996](https://arxiv.org/html/2605.06879#bib.bib79); Stoltzfus and Norris,[2016](https://arxiv.org/html/2605.06879#bib.bib78); Paulyet al\.,[2017](https://arxiv.org/html/2605.06879#bib.bib69); Acevedoet al\.,[2014](https://arxiv.org/html/2605.06879#bib.bib82)\), we assume mutation probabilities ofP​\(y→y′\)≈2\.6×10−5P\(y\\to y^\{\\prime\}\)\\approx 2\.6\\times 10^\{\-5\}for transitions and1\.4×10−71\.4\\times 10^\{\-7\}for transversions\. These probabilities are applied to all problems considered\. We construct the candidate set𝒟^𝒴′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}^\{\\prime\}of likely emergent but unobserved nucleotide sequences by retaining sequences satisfyingpe​\(y′;α\)\>ϵp\_\{e\}\(y^\{\\prime\};\\alpha\)\>\\epsilon, withϵ=1−exp⁡\(−10\)\\epsilon=1\-\\exp\(\-10\)andα=1\\alpha=1\. An ablation study onϵ\\epsilonis reported in Appendix[G](https://arxiv.org/html/2605.06879#A7)\. For the influenza tasks, this yields 30,433, 17,366, and 3,067 unobserved nucleotide sequences for fusion, binding, and evasion peptides, respectively; RSV yields 6,714 sequences, and SARS\-CoV\-2 yields 82\. Some generated nucleotide sequences translate into already observed amino acid sequences\. After removing duplicates, the remaining sequences produce 1,916, 5,203, and 688 unique amino acid sequences for the three influenza tasks, 512 for RSV, 67 for SARS\-CoV\-2, and 167,308 and 26,153 for the PSAE and A0 ProteinGym tasks, respectively\. These define the cardinality of𝒟^𝒳′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{X\}\}^\{\\prime\}in Eq\. \(LABEL:eqn:approxlike\)\. The generated nucleotide set𝒟^𝒴′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}^\{\\prime\}is also used to approximate𝒴^​\(x\)\\hat\{\\mathcal\{Y\}\}\(x\), the set of nucleotide sequences translating to amino acid sequencexx\.

Directly optimizing the loss in Eq\. \([5](https://arxiv.org/html/2605.06879#S2.E5)\) over discrete amino acid sequences is intractable\. To make optimization feasible, we encode each amino acid using three physicochemical properties known to correlate with peptide function\(Moon and Fleming,[2011](https://arxiv.org/html/2605.06879#bib.bib126); Foulquier,[2001](https://arxiv.org/html/2605.06879#bib.bib127)\), and construct continuous sequence representations by concatenating these encodings across non\-consecutive motifs \(for the influenza binding and evasion tasks\)\. For consistency, we apply the same representation to ProteinGym benchmarks, though these properties are not tailored to those sequences and may limit performance accordingly\. We also conduct an ablation study using ESM2\-based protein representations\(Linet al\.,[2023](https://arxiv.org/html/2605.06879#bib.bib161)\)in Appendix[H](https://arxiv.org/html/2605.06879#A8)\.

### 3\.4Comparison methods and metric

We compare Evo\-PU against several baselines, including Protein\-PU\(Songet al\.,[2021](https://arxiv.org/html/2605.06879#bib.bib75)\), the closest PU\-learning framework for protein design, and the standard PU\-learning method 2Step\(Bekker and Davis,[2020](https://arxiv.org/html/2605.06879#bib.bib91)\)\. To ensure fair comparison, all PU methods are trained with unlabeled datasets matching the size of𝒟^𝒳′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{X\}\}^\{\\prime\}used in Evo\-PU: 1,916 \(influenza fusion\), 5,203 \(influenza binding\), 549 \(influenza evasion\), 512 \(RSV HRC\), 67 \(SARS\-CoV\-2\), 167,308 \(PSAE\), and 26,153 \(A0\) amino acid sequences\. For baseline PU methods, unlabeled sequences are generated through uniform random sampling\.

We further compare Evo\-PU against two OCC methods: OC\-SVM\(Schölkopfet al\.,[2001](https://arxiv.org/html/2605.06879#bib.bib65)\)and iForest\(Liuet al\.,[2008](https://arxiv.org/html/2605.06879#bib.bib87)\), which use only the observed set𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}\. For consistency, all PU and OCC baselines use the same classifiers \(LR or WD\) and CHEM sequence representation as Evo\-PU\. We additionally benchmark against three PLM\-based methods: EVE\(Frazeret al\.,[2021](https://arxiv.org/html/2605.06879#bib.bib59)\), zero\-shot ESM\-1v\(Meieret al\.,[2021](https://arxiv.org/html/2605.06879#bib.bib159)\), and kNN\-ESM2\(Esmailiet al\.,[2025](https://arxiv.org/html/2605.06879#bib.bib160)\)\. For kNN\-ESM2, the same generated unlabeled sequences used in PU methods are treated as negatives\.

Detailed descriptions of baselines and optimization settings are provided in Appendices[I](https://arxiv.org/html/2605.06879#A9)and[J](https://arxiv.org/html/2605.06879#A10)\. All models are evaluated on the same test datasets using area under the ROC curve \(AUC\) and average precision \(AP\)\. For the SARS\-CoV\-2\-Emergence task \(Section[3\.1\.2](https://arxiv.org/html/2605.06879#S3.SS1.SSS2)\), we additionally compute the Spearman correlation coefficient \(ρ\\rho\) between predicted functional probability and observed variant frequency, where higher observed frequency is assumed to reflect greater evolutionary fitness\.

### 3\.5Results and discussion

In this section, we report results for three influenza motif prediction tasks \(fusion, binding, and evasion\), one RSV motif prediction \(HRC\), two SARS\-CoV\-2 motif prediction tasks \(fusion peptide function and future emergence\), and two ProteinGym benchmarks \(PSAE and A0\)\. We report AUC values for protein motif functionalities, while for SARS\-CoV\-2 fusion peptide variants emergence, we report the Spearman correlationρ\\rhowith observed variant frequencies\. We defer results on average precision \(AP\) metric to Appendix[K](https://arxiv.org/html/2605.06879#A11)\.

![Refer to caption](https://arxiv.org/html/2605.06879v1/x1.png)Figure 1:Performance comparison of Evo\-PU on six single\-organism tasks with baseline methods, including two PU\-learning approaches \(Protein\-PU and 2Step\), two one\-class classifiers \(iForest and OC\-SVM\), and three PLM\-based methods \(kNN\-ESM2, ESM\-1v, and EVE\)\. The top panels correspond to influenza functional motif prediction tasks: \(a\) Fusion, \(b\) Binding, and \(c\) Evasion\. The bottom panels present results for \(d\) RSV\-HRC functional motif prediction, \(e\) SARS\-CoV\-2 fusion peptide motif prediction, and \(f\) SARS\-CoV\-2 emerging fusion peptide variant prediction\. Panels \(a\)\-\(e\) report AUC, while panel \(f\) reports the Spearman correlation coefficientρ\\rhobetween predicted scores and observed variant frequencies\. Deterministic methods are shown as a single value, whereas stochastic methods are reported as the mean with standard error across runs\. Circle markers denote the LR classifier, square markers denote the WD classifier, diamond markers denote method\-specific classifiers, and red indicates the best\-performing method in each panel\.![Refer to caption](https://arxiv.org/html/2605.06879v1/x2.png)Figure 2:AUC performance comparison of Evo\-PU on ProteinGym tasks: \(a\) A0 and \(b\) PSAE\. Methods include PU\-learning approaches \(Protein\-PU and 2\-Step\), OCC methods \(iForest and OC\-SVM\), and PLM\-based methods \(kNN\-ESM2, ESM\-1v, and EVE\)\. Deterministic methods are shown as a single value, while stochastic methods are reported as the mean with standard error across runs\. Circle markers denote the LR classifier, square markers denote the WD classifier, diamond markers denote method\-specific classifiers, and red indicates the best\-performing method in each panel\.Figure[1](https://arxiv.org/html/2605.06879#S3.F1)summarizes performance on all single\-organism tasks across all methods\. Evo\-PU outperforms competing methods on influenza fusion and binding, as well as on RSV\-HRC and the SARS\-CoV\-2 emerging variant prediction task, and remains competitive on influenza evasion\.

In contrast, as shown in Figure[2](https://arxiv.org/html/2605.06879#S3.F2), PLM\-based methods achieve the strongest performance on the multi\-organism ProteinGym benchmarks\. We attribute the reduced performance of Evo\-PU in this regime to its design for well\-surveilled single\-organism data and to the use of a global observability parameterpop\_\{o\}, which does not capture organism\-specific surveillance rates in heterogeneous datasets\. Moreover, because ProteinGym does not provide prevalence information, we choose to assume equal prevalence across sequences, limiting Evo\-PU’s ability to fully exploit its evolutionary modeling\. Extending the framework to handle heterogeneous observability and to model evolutionary processes without explicit prevalence data is therefore a promising direction for broadening Evo\-PU to multi\-organism protein fitness prediction tasks\.

## 4Conclusion

We introduced Evo\-PU, an evolution\-informed positive–unlabeled framework for predicting protein functions that are critical to organism survival\. Evo\-PU explicitly models survivorship bias in protein sequence data by embedding nucleotide\-level mutation and natural selection into a sequence\-dependent observation model, yielding a biologically grounded likelihood for amino\-acid sequences\. Because exact likelihood computation is intractable over the full sequence space, we develop an efficient approximation that focuses on biologically plausible nucleotide\-derived variants\. We evaluate Evo\-PU on three influenza, one RSV and one SARS\-CoV\-2 motif prediction tasks, a prospective SARS\-CoV\-2 variant prediction task, and two ProteinGym benchmarks, where it outperforms existing methods in well\-surveilled viral settings and highlights both the strengths and limitations of extending survivorship\-aware modeling to protein fitness prediction across multiple organisms\.

Evo\-PU also leaves room for further development\. The current model does not account for insertions or deletions, and experimental validation of top\-ranked predictions would further strengthen its practical impact\. More broadly, extending Evo\-PU to settings with heterogeneous observability and to datasets that lack prevalence information would allow survivorship bias to be modeled in more general evolutionary contexts\. In addition, we anticipate that integrating transformer\-based architectures could further improve performance, representing an important direction for future work\.

This work has the potential for significant positive impacts across multiple domains\. In drug discovery, Evo\-PU’s ability to predict functional protein variants could accelerate the identification of therapeutic targets and the design of more effective biologics\. In biomaterials design, the method could guide the engineering of proteins with desired properties for industrial applications\. More broadly, by providing a principled approach to modeling survivorship bias in protein databases, this work advances scientific understanding of biochemistry and biology\. We also recognize potential negative impacts if the method is misused\. The ability to predict functional protein and peptide sequences could be exploited to design sequences that are dangerous to human health\. We emphasize that this work is intended for beneficial applications in medicine, materials science, and basic research, and that responsible use requires careful consideration of biosecurity implications and adherence to ethical guidelines\.

## References

- A\. Acevedo, L\. Brodsky, and R\. Andino \(2014\)Mutational and fitness landscapes of an rna virus revealed through population sequencing\.Nature505\(7485\),pp\. 686–690\.Cited by:[§3\.3](https://arxiv.org/html/2605.06879#S3.SS3.p4.14)\.
- J\. Bekker and J\. Davis \(2020\)Learning from positive and unlabeled data: a survey\.Machine Learning109\(4\),pp\. 719–760\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p3.1),[§3\.4](https://arxiv.org/html/2605.06879#S3.SS4.p1.1)\.
- L\. Bermúdez\-Guzmán, G\. Jimenez\-Huezo, A\. Arguedas, and A\. Leal \(2020\)Mutational survivorship bias: the case of pnkp\.PLoS One15\(12\),pp\. e0237682\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p1.1)\.
- H\. Cheng, L\. Koc, J\. Harmsen, T\. Shaked, T\. Chandra, H\. Aradhye, G\. Anderson, G\. Corrado, W\. Chai, M\. Ispir,et al\.\(2016\)Wide & deep learning for recommender systems\.InProceedings of the 1st workshop on deep learning for recommender systems,pp\. 7–10\.Cited by:[§3\.3](https://arxiv.org/html/2605.06879#S3.SS3.p1.1)\.
- F\. Esmaili, Y\. Qin, D\. Wang, and D\. Xu \(2025\)Kinase\-substrate prediction using an autoregressive model\.Computational and Structural Biotechnology Journal27,pp\. 1103–1111\.Cited by:[§3\.4](https://arxiv.org/html/2605.06879#S3.SS4.p2.1)\.
- E\. Foulquier \(2001\)External Links:[Link](https://www.imgt.org/IMGTeducation/Aide-memoire/_UK/aminoacids/charge/#:%CB%9C:text=3%20amino%20acids%20(arginine%2C%20lysine,atoms%20in%20their%20side%20chain)Cited by:[§3\.3](https://arxiv.org/html/2605.06879#S3.SS3.p5.1)\.
- J\. Frazer, P\. Notin, M\. Dias, A\. Gomez, J\. K\. Min, K\. Brock, Y\. Gal, and D\. S\. Marks \(2021\)Disease variant prediction with deep generative models of evolutionary data\.Nature599\(7883\),pp\. 91–95\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p4.1),[§3\.4](https://arxiv.org/html/2605.06879#S3.SS4.p2.1)\.
- R\. J\. Giordano, M\. Cardó\-Vila, J\. Lahdenranta, R\. Pasqualini, and W\. Arap \(2001\)Biopanning and rapid analysis of selective interactive ligands\.Nature Publishing Group US New York\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p2.1)\.
- Z\. Lin, H\. Akin, R\. Rao, B\. Hie, Z\. Zhu, W\. Lu, N\. Smetanin, R\. Verkuil, O\. Kabeli, Y\. Shmueli,et al\.\(2023\)Evolutionary\-scale prediction of atomic\-level protein structure with a language model\.Science379\(6637\),pp\. 1123–1130\.Cited by:[§3\.3](https://arxiv.org/html/2605.06879#S3.SS3.p5.1)\.
- B\. Liu, Y\. Dai, X\. Li, W\. S\. Lee, and P\. S\. Yu \(2003\)Building text classifiers using positive and unlabeled examples\.InThird IEEE international conference on data mining,pp\. 179–186\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p3.1)\.
- F\. T\. Liu, K\. M\. Ting, and Z\. Zhou \(2008\)Isolation forest\.In2008 eighth ieee international conference on data mining,pp\. 413–422\.Cited by:[§3\.4](https://arxiv.org/html/2605.06879#S3.SS4.p2.1)\.
- G\. Luo, X\. Li, Z\. Han, Z\. Zhang, Q\. Yang, H\. Guo, and J\. Fang \(2016\)Transition and transversion mutations are biased towards gc in transposons of chilo suppressalis \(lepidoptera: pyralidae\)\.Genes7\(10\),pp\. 72\.Cited by:[§2\.4](https://arxiv.org/html/2605.06879#S2.SS4.p1.4),[§3\.3](https://arxiv.org/html/2605.06879#S3.SS3.p4.14)\.
- M\. J\. McGuire, S\. Li, and K\. C\. Brown \(2009\)Biopanning of phage displayed peptide libraries for the isolation of cell\-specific ligands\.Methods Mol Biol504,pp\. 291–321\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p2.1)\.
- J\. Meier, R\. Rao, R\. Verkuil, J\. Liu, T\. Sercu, and A\. Rives \(2021\)Language models enable zero\-shot prediction of the effects of mutations on protein function\.Advances in neural information processing systems34,pp\. 29287–29303\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p4.1),[§3\.4](https://arxiv.org/html/2605.06879#S3.SS4.p2.1)\.
- C\. P\. Moon and K\. G\. Fleming \(2011\)Side\-chain hydrophobicity scale derived from transmembrane protein folding into lipid bilayers\.Proceedings of the National Academy of Sciences108\(25\),pp\. 10174–10177\.Cited by:[§3\.3](https://arxiv.org/html/2605.06879#S3.SS3.p5.1)\.
- H\. Nair, W\. A\. Brooks, M\. Katz, A\. Roca, J\. A\. Berkley, S\. A\. Madhi, J\. M\. Simmerman, A\. Gordon, M\. Sato, S\. Howie,et al\.\(2011\)Global burden of respiratory infections due to seasonal influenza in young children: a systematic review and meta\-analysis\.The Lancet378\(9807\),pp\. 1917–1930\.Cited by:[§3\.3](https://arxiv.org/html/2605.06879#S3.SS3.p3.7)\.
- P\. Notin, A\. Kollasch, D\. Ritter, L\. Van Niekerk, S\. Paul, H\. Spinner, N\. Rollins, A\. Shaw, R\. Orenbuch, R\. Weitzman,et al\.\(2023\)Proteingym: large\-scale benchmarks for protein fitness prediction and design\.Advances in Neural Information Processing Systems36,pp\. 64331–64379\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p6.1),[§3\.2](https://arxiv.org/html/2605.06879#S3.SS2.p1.2),[§3](https://arxiv.org/html/2605.06879#S3.p1.1)\.
- M\. D\. Pauly, M\. C\. Procario, and A\. S\. Lauring \(2017\)A novel twelve class fluctuation test reveals higher than expected mutation rates for influenza a viruses\.Elife6,pp\. e26437\.Cited by:[§3\.3](https://arxiv.org/html/2605.06879#S3.SS3.p4.14)\.
- P\. Perera, P\. Oza, and V\. M\. Patel \(2021\)One\-class classification: a survey\.arXiv preprint arXiv:2101\.03064\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p4.1)\.
- B\. Schölkopf, J\. C\. Platt, J\. Shawe\-Taylor, A\. J\. Smola, and R\. C\. Williamson \(2001\)Estimating the support of a high\-dimensional distribution\.Neural computation13\(7\),pp\. 1443–1471\.Cited by:[§3\.4](https://arxiv.org/html/2605.06879#S3.SS4.p2.1)\.
- H\. Song, B\. J\. Bremer, E\. C\. Hinds, G\. Raskutti, and P\. A\. Romero \(2021\)Inferring protein sequence\-function relationships with large\-scale positive\-unlabeled learning\.Cell systems12\(1\),pp\. 92–101\.Cited by:[2nd item](https://arxiv.org/html/2605.06879#S2.I1.i2.p1.1),[§3\.4](https://arxiv.org/html/2605.06879#S3.SS4.p1.1)\.
- A\. Stoltzfus and R\. W\. Norris \(2016\)On the causes of evolutionary transition: transversion bias\.Molecular biology and evolution33\(3\),pp\. 595–602\.Cited by:[§3\.3](https://arxiv.org/html/2605.06879#S3.SS3.p4.14)\.
- D\. M\. Tax and R\. P\. Duin \(2001\)Uniform object generation for optimizing one\-class classifiers\.Journal of machine learning research2\(Dec\),pp\. 155–173\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p4.1)\.
- N\. N\. Thadani, S\. Gurev, P\. Notin, N\. Youssef, N\. J\. Rollins, D\. Ritter, C\. Sander, Y\. Gal, and D\. S\. Marks \(2023\)Learning from prepandemic data to forecast viral escape\.Nature622\(7984\),pp\. 818–825\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p4.1)\.
- A\. Thomas, B\. D\. Evans, M\. van der Giezen, and N\. J\. Harmer \(2022\)Survivor bias drives overestimation of stability in reconstructed ancestral proteins\.bioRxiv,pp\. 2022–11\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p1.1)\.
- K\. E\. Tiller and P\. M\. Tessier \(2015\)Advances in antibody design\.Annual review of biomedical engineering17\(1\),pp\. 191–216\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p1.1)\.
- V\. Tournier, S\. Duquesne, F\. Guillamot, H\. Cramail, D\. Taton, A\. Marty, and I\. André \(2023\)Enzymes’ power for plastics degradation\.Chemical Reviews123\(9\),pp\. 5612–5701\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p1.1)\.
- J\. Wakeley \(1996\)The excess of transitions among nucleotide substitutions: new methods of estimating transition bias underscore its significance\.Trends in ecology & evolution11\(4\),pp\. 158–162\.Cited by:[§3\.3](https://arxiv.org/html/2605.06879#S3.SS3.p4.14)\.
- W\. Wei, L\. Petrone, Y\. Tan, H\. Cai, J\. N\. Israelachvili, A\. Miserez, and J\. H\. Waite \(2016\)An underwater surface\-drying peptide inspired by a mussel adhesive protein\.Advanced functional materials26\(20\),pp\. 3496–3507\.Cited by:[§1](https://arxiv.org/html/2605.06879#S1.p1.1)\.

Appendix

## Appendix ALiterature review

In this section, we review existing methods relevant to our Evo\-PU framework\. We first discuss general approaches, including PU learning, one\-class classification \(OCC\), and protein language model \(PLM\)\-based methods, and then highlight specific studies that directly address protein applications, which are most relevant to our work\.

Positive\-unlabeled learning:Typically, PU learning methods involve two primary steps: \(1\) identifying some unlabeled data as reliable negatives and \(2\) training a final classifier model using the positive data and reliable negatives\. Examples of such methods include Spy\-EM\[Liu et al\.,[2002](https://arxiv.org/html/2605.06879#biba.bib20)\]and Roc\-SVM\[Li et al\.,[2010](https://arxiv.org/html/2605.06879#biba.bib18)\]\. Alternatively, some PU learning methods treat unlabeled data as negative but assign greater importance to positive data by penalizing incorrect predictions of positive instances\. Examples include biased\-SVM\[Liu et al\.,[2003](https://arxiv.org/html/2605.06879#biba.bib21)\]and weighted logistic regression\[Lee and Liu,[2003](https://arxiv.org/html/2605.06879#biba.bib17)\]\. A particularly relevant PU learning method for our study is the PU learning for protein design \(Protein\-PU\) framework\[Song et al\.,[2021](https://arxiv.org/html/2605.06879#biba.bib38)\], which fits a logistic regression model using positive and unlabeled data through a custom loss function that incorporates prior knowledge about the distribution of labeled data\.

One\-class classification:OCC methods can be divided into One\-class Support Vector Machine \(OSVM\)\-based and non\-OSVM\-based approaches\[Khan and Madden,[2014](https://arxiv.org/html/2605.06879#biba.bib15)\]\. Pioneer OSVM\-based methods build a smallest hyper\-sphere that encloses positive samples \(SVDD\)\[Tax and Duin,[1999a](https://arxiv.org/html/2605.06879#biba.bib41),[b](https://arxiv.org/html/2605.06879#biba.bib42),[2001](https://arxiv.org/html/2605.06879#biba.bib43)\]or a hyper\-plane that separates positive data from the origin \(OC\-SVM\)\[Schölkopf et al\.,[2001](https://arxiv.org/html/2605.06879#biba.bib35)\]\. Recent advances in OSVM\-based methods have used neural network for feature extraction and apply traditional OSVM approaches over the extracted features\[Erfani et al\.,[2016](https://arxiv.org/html/2605.06879#biba.bib8), Ghafoori and Leckie,[2020](https://arxiv.org/html/2605.06879#biba.bib11)\]\. Examples of non\-OSVM\-based methods includes the ones using neural network models\[Manevitz and Yousef,[2001](https://arxiv.org/html/2605.06879#biba.bib24), Skabar,[2003](https://arxiv.org/html/2605.06879#biba.bib37), Chalapathy,[2018](https://arxiv.org/html/2605.06879#biba.bib3)\], decision trees\[Liu et al\.,[2008](https://arxiv.org/html/2605.06879#biba.bib22), Désir et al\.,[2012](https://arxiv.org/html/2605.06879#biba.bib7), Xu et al\.,[2023](https://arxiv.org/html/2605.06879#biba.bib46)\], nearest neighbors\[Munroe and Madden,[2005](https://arxiv.org/html/2605.06879#biba.bib29)\]and Bayesian classifiers\[Wang and Stolfo,[2003](https://arxiv.org/html/2605.06879#biba.bib45)\]\. OCC framework has been tailored to protein\-related applications\. For example,\[Mei and Zhu,[2015](https://arxiv.org/html/2605.06879#biba.bib27)\]considered a problem of prediciting protein\-protein interaction and proposed to use OSVM\-based method to sample negative data first and then use the two\-class SVM as a final classifier\.\[Yousef and Charkari,[2015](https://arxiv.org/html/2605.06879#biba.bib48)\]proposed to use SVDD together with physicochemical property\-based representations of proteins to classify genes with diseases of interest\.

Protein classification using protein language models:Recent methods for protein classification leverage deep generative models trained on multiple sequence alignments to capture amino acid distributions and evolutionary conservation\. For example, zero\-shot prediction via the protein language model ESM\-1v\[Meier et al\.,[2021](https://arxiv.org/html/2605.06879#biba.bib28)\]that computes the fitness likelihood of a queried sequence with respect to a wild type sequence\. The Evolutionary Model of Variant Effect \(EVE\)\[Frazer et al\.,[2021](https://arxiv.org/html/2605.06879#biba.bib10)\]predicts pathogenicity by training a variational autoencoder \(VAE\) on MSA\-derived sequences of a human protein of interest\. The VAE estimates the relative likelihood of each single amino acid variant compared to the wild type, producing evolutionary indices\. These indices are then used to fit a two\-component Gaussian mixture model that outputs pathogenicity probabilities\. Another example is EVEscape\[Thadani et al\.,[2023](https://arxiv.org/html/2605.06879#biba.bib44)\], which extends the EVE framework by combining information from evolutionary scores from EVE with protein structural and chemical information and using logistic functions to predict the likelihood of immune escape in viral variants\.

## Appendix BMutation

In this section, we provide a list of possible RNA mutations through transition and transversion pathways as presented in Table[A1](https://arxiv.org/html/2605.06879#A2.T1)\.

Table A1:Possible scenarios of RNA nucleotide mutationsRNA MutationsTransitionTransversion\(A\)→\\rightarrow\(G\)\(G\)→\\rightarrow\(A\)\(C\)→\\rightarrow\(U\)\(U\)→\\rightarrow\(C\)\(A\)→\\rightarrow\(C\)\(A\)→\\rightarrow\(U\)\(G\)→\\rightarrow\(C\)\(G\)→\\rightarrow\(U\)\(C\)→\\rightarrow\(G\)\(C\)→\\rightarrow\(A\)\(U\)→\\rightarrow\(A\)\(U\)→\\rightarrow\(G\)
## Appendix CDataset

We obtained the prevalence data on host\-infecting influenza hemagglutinin protein nucleotide sequences collected between year 2001 and year 2024\[[NCBI,](https://arxiv.org/html/2605.06879#biba.bib30), Shu and McCauley,[2017](https://arxiv.org/html/2605.06879#biba.bib36)\]\. We extracted 7,383 unique nucleotide sequences that encode 504 unique amino acid sequences for fusion peptide mutants\. In the binding peptide case, only human\-infecting hemagglutinin protein nucleotide sequences were used, since different hemagglutinin subtypes can bind with non\-human hosts via their affinities with other types of influenza receptors\[Matrosovich et al\.,[2009](https://arxiv.org/html/2605.06879#biba.bib26)\]\. We identified 3,862 unique nucleotide sequences encoding 1,458 distinct binding peptide protein mutants\. For the “evasion peptide" \(Ca1 epitope\) case, only human\-infecting H1 hemagglutinin nucleotide sequences collected between year 2001 and year 2024 were used\. We identified 497 unique nucleotide sequences encoding 181 distinct protein sequences located at the Ca1 antigenic site\. From the human\-RSV surveillance data collected before 2026,\[Shu and McCauley,[2017](https://arxiv.org/html/2605.06879#biba.bib36)\]we identified 366 unique nucleotide sequences that encode 73 distinct HRC functional motif mutants\. We obtained the sequencing data of human\-infecting SARS\-CoV\-2 since 2019 from NCBI\[NCBI,[2026](https://arxiv.org/html/2605.06879#biba.bib31)\]\. 2\.8 million sequences collected at the early stage of the outbreak \(Oct 2021\) were used as training, including 657 unique nucleotide sequences that encode 357 unique amino acid sequences\. In our framework, we designate the nucleotide datasets as the observed nucleotide dataset𝒟𝒴\\mathcal\{D\}\_\{\\mathcal\{Y\}\}used to compute the emergence probability presented in the model proposed in Section[2\.3](https://arxiv.org/html/2605.06879#S2.SS3)as a part of the approximated log\-likelihood function in Eq\. \(LABEL:eqn:approxlike\)\. We designate the amino acid datasets translated from the observed nucleotide sequences as the observed amino acid sequence set𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}\.

The held\-out test dataset for influenza fusion peptides was from studies examining the fusion properties of previously unseen fusion peptide mutants via site\-directed mutagenesis\[Han et al\.,[1999](https://arxiv.org/html/2605.06879#biba.bib12), Qiao et al\.,[1999](https://arxiv.org/html/2605.06879#biba.bib34), Tamm et al\.,[2002](https://arxiv.org/html/2605.06879#biba.bib40), Lai et al\.,[2006](https://arxiv.org/html/2605.06879#biba.bib16), Su et al\.,[2008](https://arxiv.org/html/2605.06879#biba.bib39), Cross et al\.,[2009](https://arxiv.org/html/2605.06879#biba.bib6)\]\. It contains 76 unique amino acid sequences, of which 46 exhibit the fusion property \(positive samples\) and 30 show impaired fusion \(negative samples\)\. Similarly, the test dataset for influenza binding peptides comprises 33 lab\-generated mutagenesis results\[Yang et al\.,[2007](https://arxiv.org/html/2605.06879#biba.bib47), Martín et al\.,[1998](https://arxiv.org/html/2605.06879#biba.bib25), Maines et al\.,[2011](https://arxiv.org/html/2605.06879#biba.bib23), Chen et al\.,[2012](https://arxiv.org/html/2605.06879#biba.bib4)\]and 11 newly observed functional binding peptides from 2025\. Among the 44 test sequences, 23 show binding affinity to human influenza receptors, while the remaining 21 sequences show no binding\. For the influenza evasion task, the test set contains 51 peptide sequences collected in 2025 and labeled as evasive \(functional\)\. To form the non\-evasive class, we randomly sampled 51 observed nucleotide sequences and introduced nine nucleotide mutations to produce unobserved variants, which were then translated to amino acids\. With this mutation distance, these sequences are sufficiently dissimilar from the functional set and are unlikely to support the virus to invade host immune response, so we treat them as negatives in the test set\. For the RSV task, the test sequences are from site\-directed mutagenesis studies\[Bermingham et al\.,[2018](https://arxiv.org/html/2605.06879#biba.bib2), Hicks et al\.,[2018](https://arxiv.org/html/2605.06879#biba.bib13)\], with 15 mutants showing impaired fusion capability \(negatives\) and 10 mutants showing preserved fusion capability \(positives\)\. For SARS\-CoV\-2 task, the test sequences include 19 fusion peptide mutants that are observed after Oct 2021, together with 19 randomly generated mutants with 10\-point\-mutation from the most prevailing fusion peptide sequence, and we treat them as negatives in this task\. For the task to predict the future emergence of SARS\-CoV\-2 fusion peptide variants, we use the same set of 657 unique nucleotide sequences that encode 357 unique amino acid sequences observed by Oct 2021 as training data\. These sequences form the nucleotide observation set𝒟𝒴\\mathcal\{D\}\_\{\\mathcal\{Y\}\}, and the translated amino\-acid observation set𝒟𝒳\\mathcal\{D\}\_\{\\mathcal\{X\}\}, used in Evo\-PU in the same way as the previous tasks\. A total of 19 new fusion peptide variants \(observed after Oct 2021\) with observed frequency were used as test data\.

## Appendix DA wide and deep neural network architecture

We customized a neural network structure inspired by\[Cheng et al\.,[2016](https://arxiv.org/html/2605.06879#biba.bib5)\]integrates linear memorization with nonlinear generalization for protein function classification\. The model takes an input feature vector and the input is processed through two parallel branches: a wide component, consisting of a single fully connected layer that projects the input into a 64\-dimensional space, and a deep component, implemented as a two\-layer perceptron with 32 and 16 hidden units, each followed by batch normalization, ReLU activation, and dropout \(p=0\.3p=0\.3\)\. The outputs of the wide and deep branches are concatenated into an 80\-dimensional joint feature representation, which is then mapped to a single sigmoid output neuron for binary classification\. Weights are initialized with Kaiming\-normal initialization\.

## Appendix EAblation study of the regularization coefficientλ\\lambda

In this section, we investigate the effect of the regularization coefficientλ\\lambdaused during training on the performance of the Evo\-PU framework\. We consider all single\-organism tasks, including Influenza\-Fusion, Influenza\-Binding, Influenza\-Evasion, RSV\-HRC, SARS\-CoV\-2\-Fusion, and SARS\-CoV\-2\-Emergence\.

The default value used throughout the main paper isλ=50\\lambda=50\. To evaluate the sensitivity of the method to this hyperparameter, we additionally considerλ=10\\lambda=10andλ=100\\lambda=100\. The resulting AUC and Spearman\-ρ\\rhoperformances are shown in Figure[A1](https://arxiv.org/html/2605.06879#A5.F1), while the AP performances are shown in Figure[A2](https://arxiv.org/html/2605.06879#A5.F2)\. In each subplot, circle markers denote the LR classifier and square markers denote the WD classifier within the Evo\-PU framework\. The blue markers correspond to the setting reported in the main paper \(λ=50\\lambda=50\)\.

Overall, the performance of Evo\-PU is relatively stable across different values ofλ\\lambda, suggesting that the method is not highly sensitive to the choice of regularization coefficient\. The main exception is the Influenza\-Binding task, where using a smaller regularization coefficient \(λ=10\\lambda=10\) leads to a noticeable performance degradation, particularly for the WD classifier\. This behavior suggests that weaker regularization may allow the model to overfit the training distribution in this task\.

![Refer to caption](https://arxiv.org/html/2605.06879v1/x3.png)Figure A1:Performance metrics of Evo\-PU under different regularization coefficientsλ\\lambdaon the following tasks: \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, \(e\) SARS\-CoV\-2\-Fusion, and \(f\) SARS\-CoV\-2\-Emergence\. Circle markers denote the LR classifier, while square markers denote the WD classifier within the Evo\-PU framework\. The x\-axis shows the regularization coefficient used during training\. For all tasks except SARS\-CoV\-2\-Emergence, the y\-axis reports the AUC metric\. For SARS\-CoV\-2\-Emergence, the y\-axis reports the Spearman\-ρ\\rhocorrelation between predicted scores and observed emergence frequencies\. Blue markers indicate the default setting \(λ=50\\lambda=50\) used in the main paper\. Overall, Evo\-PU demonstrates relatively stable performance across different values ofλ\\lambda, with the largest sensitivity observed in the Influenza\-Binding task when using weaker regularization \(λ=10\\lambda=10\)\.![Refer to caption](https://arxiv.org/html/2605.06879v1/x4.png)Figure A2:Average precision \(AP\) performance of Evo\-PU under different regularization coefficientsλ\\lambdaon the following tasks: \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, and \(e\) SARS\-CoV\-2\-Fusion\. Circle markers denote the LR classifier, while square markers denote the WD classifier within the Evo\-PU framework\. The x\-axis shows the regularization coefficient used during training, while the y\-axis reports the AP metric\. Blue markers indicate the default setting \(λ=50\\lambda=50\) used in the main paper\. Overall, Evo\-PU demonstrates relatively stable performance across different values ofλ\\lambda, with only minor performance variations across most tasks\.
## Appendix FAblation study of total infection casesTT

In this section, we investigate the effect of the total infection case parameterTT, which is used to approximate the termc​\(y′\)c\(y^\{\\prime\}\)in the emergence probability model of the Evo\-PU framework presented in Eq\. \([3](https://arxiv.org/html/2605.06879#S2.E3)\) in Section[2\.3](https://arxiv.org/html/2605.06879#S2.SS3)\. We consider all single\-organism tasks, including Influenza\-Fusion, Influenza\-Binding, Influenza\-Evasion, RSV\-HRC, SARS\-CoV\-2\-Fusion, and SARS\-CoV\-2\-Emergence\.

The default values ofTTused in the main paper are 24B for Influenza\-Fusion, Influenza\-Binding, and RSV\-HRC, 12B for Influenza\-Evasion, and 1B for the two SARS\-CoV\-2 tasks\. To evaluate the sensitivity of Evo\-PU to this parameter, we additionally consider three alternative values for each task, corresponding to values approximately 10 times lower, 2 times lower, and 2 times higher than the default setting\. Specifically, we considerT∈\{2\.4​B,10​B,24​B,50​B\}T\\in\\\{2\.4\\text\{B\},10\\text\{B\},24\\text\{B\},50\\text\{B\}\\\}for Influenza\-Fusion, Influenza\-Binding, and RSV\-HRC,T∈\{1\.2​B,5​B,12​B,25​B\}T\\in\\\{1\.2\\text\{B\},5\\text\{B\},12\\text\{B\},25\\text\{B\}\\\}for Influenza\-Evasion, andT∈\{0\.1​B,0\.5​B,1​B,2​B\}T\\in\\\{0\.1\\text\{B\},0\.5\\text\{B\},1\\text\{B\},2\\text\{B\}\\\}for the two SARS\-CoV\-2 tasks\.

Since different values ofTTlead to different emergence probabilities, the resulting generated sequence sets also vary across configurations\. Therefore, in Table[A2](https://arxiv.org/html/2605.06879#A6.T2), we report the number of generated nucleotide sequences \(\#nuc\) that satisfy the selection conditionpe​\(y;α\)\>ϵp\_\{e\}\(y;\\alpha\)\>\\epsilon, using the default valuesα=1\\alpha=1andϵ=1−exp⁡\(−10\)\\epsilon=1\-\\exp\(\-10\)from the main paper, together with the number of unique translated amino acid sequences \(\#amino\)\.

The AUC and Spearman\-ρ\\rhoperformances are reported in Figure[A3](https://arxiv.org/html/2605.06879#A6.F3), while the AP performances are reported in Figure[A4](https://arxiv.org/html/2605.06879#A6.F4)\. In each subplot, circle markers denote the LR classifier and square markers denote the WD classifier within the Evo\-PU framework\. Blue markers indicate the default configuration reported in the main paper\.

Overall, Evo\-PU demonstrates relatively stable performance across a wide range ofTTvalues\. The main exception is the Influenza\-Binding task, where performance noticeably decreases when using the largest infection estimate \(T=50​BT=50\\text\{B\}\), suggesting that excessively large generated sequence sets may introduce noisier unlabeled examples for this task\.

Table A2:Number of generated nucleotide sequences \(\#nuc\) and their corresponding unique translated amino acid sequences \(\#amino\) under different estimated total infection casesTTacross all tasks including \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, \(e\) SARS\-CoV\-2\-Fusion, and \(f\) SARS\-CoV\-2\-Emergence\. For each task, the table reports the number of generated nucleotide sequences that satisfy the selection criterionpe​\(y;α\)\>ϵp\_\{e\}\(y;\\alpha\)\>\\epsilonusing the default parametersα=1\\alpha=1andϵ=1−exp⁡\(−10\)\\epsilon=1\-\\exp\(\-10\)from the main paper, together with the number of resulting unique translated amino acid sequences\. Larger values ofTTgenerally lead to larger generated sequence sets\.TaskTotal Infection Cases \(TT\)\#nuc\#amino\(a\) Influenza\-Fusion2\.4B55,4552,45010B13,7521,11824B30,4331,91650B3,429558\(b\) Influenza\-Binding2\.4B2,3961,26410B8,6412,89324B17,3665,20350B29,7098,376\(c\) Influenza\-Evasion1\.2B6831735B1,72352412B3,06768825B4,4081,004\(d\) RSV\-HRC2\.4B1,6508510B3,88321124B6,71451250B10,376733\(e\) SARS\-CoV\-2\-Fusion0\.1B11100\.5B11101B82672B8667\(f\) SARS\-CoV\-2\-Emergence0\.1B11100\.5B11101B82672B8667![Refer to caption](https://arxiv.org/html/2605.06879v1/x5.png)Figure A3:Performance metrics of Evo\-PU under different total infection case estimatesTTfor all single\-organism tasks including \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, \(e\) SARS\-CoV\-2\-Fusion, and \(f\) SARS\-CoV\-2\-Emergence\. Circle markers denote the LR classifier, while square markers denote the WD classifier within the Evo\-PU framework\. The x\-axis shows the estimated total infection cases used in the emergence probability approximation\. For all tasks except SARS\-CoV\-2\-Emergence, the y\-axis reports the AUC metric\. For SARS\-CoV\-2\-Emergence, the y\-axis reports the Spearman\-ρ\\rhocorrelation between predicted scores and observed emergence frequencies\. Blue markers indicate the default configuration reported in the main paper\. Overall, Evo\-PU demonstrates relatively stable performance across different values ofTT, with the largest sensitivity observed in the Influenza\-Binding task when usingT=50​BT=50\\text\{B\}\.![Refer to caption](https://arxiv.org/html/2605.06879v1/x6.png)Figure A4:Average precision \(AP\) performance of Evo\-PU under different total infection case estimatesTTfor all single\-organism tasks including \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, and \(e\) SARS\-CoV\-2\-Fusion\. Circle markers denote the LR classifier, while square markers denote the WD classifier within the Evo\-PU framework\. The x\-axis shows the estimated total infection cases used in the emergence probability approximation, while the y\-axis reports the AP metric\. Blue markers indicate the default configuration reported in the main paper\. The results show that Evo\-PU remains generally robust to the choice ofTTacross most tasks\.
## Appendix GAblation study of the threshold for unobserved nucleotidesϵ\\epsilon

In the construction of the nucleotide approximation set𝒟^𝒴′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}^\{\\prime\}described in Section[2\.5](https://arxiv.org/html/2605.06879#S2.SS5), we generate nucleotide sequences that are one point mutation away from the observed nucleotide sequences and retain only those with emergence probability satisfyingpe​\(y;α\)\>ϵp\_\{e\}\(y;\\alpha\)\>\\epsilonfor fixedϵ,α\>0\\epsilon,\\alpha\>0\. In the experiments reported in the main paper, we useα=1\\alpha=1andϵ=1−exp⁡\(−10\)\\epsilon=1\-\\exp\(\-10\)so that only nucleotide sequences with high emergence probabilities are included in the approximation set\.

To investigate the effect of the nucleotide selection threshold, we conduct an ablation study on the parameterϵ\\epsilon\. We consider all single\-organism tasks, including Influenza\-Fusion, Influenza\-Binding, Influenza\-Evasion, RSV\-HRC, SARS\-CoV\-2\-Fusion, and SARS\-CoV\-2\-Emergence\. In addition to the default valueϵ=1−exp⁡\(−10\)\\epsilon=1\-\\exp\(\-10\)used in the main paper, we consider two alternative thresholds, namelyϵ=1−exp⁡\(−5\)\\epsilon=1\-\\exp\(\-5\)andϵ=1−exp⁡\(−100\)\\epsilon=1\-\\exp\(\-100\)\.

Although these values ofϵ\\epsilonare numerically close to 1, they lead to substantially different sizes of the nucleotide approximation set𝒟^𝒴′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}^\{\\prime\}and, consequently, different numbers of translated amino acid sequences in𝒟^𝒳′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{X\}\}^\{\\prime\}\. The numbers of generated nucleotide and amino acid sequences for each value ofϵ\\epsilonare reported in Table[A3](https://arxiv.org/html/2605.06879#A7.T3)\.

The AUC and Spearman\-ρ\\rhoperformances are reported in Figure[A5](https://arxiv.org/html/2605.06879#A7.F5), while the AP performances are reported in Figure[A6](https://arxiv.org/html/2605.06879#A7.F6)\. For all tasks except SARS\-CoV\-2\-Emergence, the reported metric is AUC or AP, while for SARS\-CoV\-2\-Emergence we report Spearman\-ρ\\rho\. In each subplot, circle markers denote the LR classifier and square markers denote the WD classifier within the Evo\-PU framework\. Blue markers indicate the configuration reported in the main paper\.

Overall, Evo\-PU demonstrates stable performance across all considered values ofϵ\\epsilonfor all tasks\. These results suggest that the framework is relatively robust to the choice of nucleotide selection threshold, and that the default valueϵ=1−exp⁡\(−10\)\\epsilon=1\-\\exp\(\-10\)used in the main paper provides a reasonable balance between restricting the approximation to highly probable nucleotide sequences and maintaining predictive performance\.

Table A3:Number of generated nucleotide sequences \(\#nuc\) and their corresponding unique translated amino acid sequences \(\#amino\) under different nucleotide selection thresholdsϵ\\epsilonacross all tasks including \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, \(e\) SARS\-CoV\-2\-Fusion, and \(f\) SARS\-CoV\-2\-Emergence\. For each task, the table reports the number of nucleotide sequences satisfying the selection conditionpe​\(y;α\)\>ϵp\_\{e\}\(y;\\alpha\)\>\\epsilonusing the default parameterα=1\\alpha=1from the main paper, together with the number of resulting unique translated amino acid sequences\. Smaller values ofϵ\\epsilonlead to larger approximation sets by allowing nucleotide sequences with lower emergence probabilities to be included\.![Refer to caption](https://arxiv.org/html/2605.06879v1/x7.png)Figure A5:Performance metrics of Evo\-PU under different nucleotide selection thresholdsϵ\\epsilonfor the following problems: \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, \(e\) SARS\-CoV\-2\-Fusion, and \(f\) SARS\-CoV\-2\-Emergence\. Circle markers denote the LR classifier, while square markers denote the WD classifier within the Evo\-PU framework\. The x\-axis shows the nucleotide selection thresholdϵ\\epsilonused when constructing the approximation set𝒟^𝒴′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}^\{\\prime\}\. For all tasks except SARS\-CoV\-2\-Emergence, the y\-axis reports the AUC metric\. For SARS\-CoV\-2\-Emergence, the y\-axis reports the Spearman\-ρ\\rhocorrelation between predicted scores and observed emergence frequencies\. Blue markers indicate the default configuration reported in the main paper\. Overall, Evo\-PU demonstrates stable performance across different values ofϵ\\epsilon\.![Refer to caption](https://arxiv.org/html/2605.06879v1/x8.png)Figure A6:Average precision \(AP\) performance of Evo\-PU under different nucleotide selection thresholdsϵ\\epsilonfor the following problems: \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, and \(e\) SARS\-CoV\-2\-Fusion\. Circle markers denote the LR classifier, while square markers denote the WD classifier within the Evo\-PU framework\. The x\-axis shows the nucleotide selection threshold used when constructing the approximation set𝒟^𝒴′\\hat\{\\mathcal\{D\}\}\_\{\\mathcal\{Y\}\}^\{\\prime\}, while the y\-axis reports the AP metric\. Blue markers indicate the default configuration reported in the main paper\. The results show that Evo\-PU remains generally robust to the choice ofϵ\\epsilonacross all considered tasks\.
## Appendix HAblation study of protein representation

As discussed in Section[3\.3](https://arxiv.org/html/2605.06879#S3.SS3), directly optimizing the loss function in Eq\. \([5](https://arxiv.org/html/2605.06879#S2.E5)\) over discrete amino\-acid sequence space is computationally challenging\. In our experiments, we therefore map peptide sequences into a continuous feature space using amino\-acid chemical property descriptors \(CHEM\) and perform optimization in the resulting continuous domain\. This representation was motivated by prior studies demonstrating correlations between amino\-acid chemical properties and influenza viral protein functions\. For consistency, we also applied the same representation to RSV and SARS\-CoV\-2 tasks\.

However, the Evo\-PU framework is not restricted to this specific representation\. In this section, we investigate the performance of Evo\-PU when coupled with sequence representations extracted from the pretrained ESM2 protein language model\[Lin et al\.,[2023](https://arxiv.org/html/2605.06879#biba.bib19)\]\. Specifically, we use theesm2\_t30\_150M\_UR50Dmodel, which produces a fixed\-length 640\-dimensional representation for each sequence\. We consider all single\-organism tasks, including Influenza\-Fusion, Influenza\-Binding, Influenza\-Evasion, RSV\-HRC, SARS\-CoV\-2\-Fusion, and SARS\-CoV\-2\-Emergence\.

The AUC metric for all problems except SARS\-CoV\-2\-Emergence, together with Spearman\-ρ\\rhofor SARS\-CoV\-2\-Emergence, is reported in Figure[A7](https://arxiv.org/html/2605.06879#A8.F7)\. The AP metric for all problems except SARS\-CoV\-2\-Emergence is reported in Figure[A8](https://arxiv.org/html/2605.06879#A8.F8)\.

Overall, the results show that Evo\-PU with the ESM2 representation consistently yields lower performance across nearly all tasks and evaluation metrics compared to the CHEM representation\. We hypothesize that this degradation is primarily due to the substantially higher dimensionality of the ESM2 embedding space combined with the relatively limited amount of training data available in these biological tasks, which may lead to more difficult optimization and increased risk of overfitting\.

Despite the observed performance degradation, this study highlights the importance of investigating task\-specific protein representations within the Evo\-PU framework\. Furthermore, representations derived from protein language models provide fixed\-dimensional embeddings regardless of sequence length, making them naturally compatible with insertions and deletions\. In contrast, the handcrafted CHEM representation would require additional modifications to accommodate variable\-length sequences\. Consequently, integrating protein language model representations into Evo\-PU remains a promising direction for future work\.

![Refer to caption](https://arxiv.org/html/2605.06879v1/x9.png)Figure A7:Performance metrics of Evo\-PU under different protein sequence representations on the following problems: \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, \(e\) SARS\-CoV\-2\-Fusion, and \(f\) SARS\-CoV\-2\-Emergence\.The considered sequence representations are the handcrafted amino\-acid chemical property representation \(CHEM\) and the 640\-dimensional embedding extracted from the pretrainedesm2\_t30\_150M\_UR50Dprotein language model \(ESM2\)\. Circle markers denote the LR classifier, while square markers denote the WD classifier within the Evo\-PU framework\. For all tasks except SARS\-CoV\-2\-Emergence, the y\-axis reports the AUC metric\. For SARS\-CoV\-2\-Emergence, the y\-axis reports the Spearman\-ρ\\rhocorrelation between predicted scores and observed emergence frequencies\. Blue markers indicate the configuration reported in the main paper\. Overall, Evo\-PU with the CHEM representation consistently outperforms the ESM2 representation across most tasks and evaluation metrics\.![Refer to caption](https://arxiv.org/html/2605.06879v1/x10.png)Figure A8:Average precision \(AP\) performance of Evo\-PU under different protein sequence representations on the following problems: \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, and \(e\) SARS\-CoV\-2\-Fusion\. The considered sequence representations are the handcrafted amino\-acid chemical property representation \(CHEM\) and the 640\-dimensional embedding extracted from the pretrainedesm2\_t30\_150M\_UR50Dprotein language model \(ESM2\)\. Circle markers denote the LR classifier, while square markers denote the WD classifier within the Evo\-PU framework\. The y\-axis reports the AP metric\. Blue markers indicate the configuration reported in the main paper\. Overall, Evo\-PU with the CHEM representation consistently achieves higher AP performance than the ESM2 representation across most tasks\.
## Appendix IDetails of baseline methods

### I\.1PU\-learning methods

2Step:In 2Step\[Bekker and Davis,[2020](https://arxiv.org/html/2605.06879#biba.bib1)\], 20% of the positive samples are randomly selected and inserted into the unlabeled set as “spies\.” These spies and the unlabeled data are temporarily treated as negatives, while the remaining 80% of the positives are used as labeled positives\. A primary classifier is trained on this combined dataset \(spies \+ unlabeled as negatives, remaining positives as positives\)\. After training, the primary model assigns a probability of being positive to each sequence\. The lowest probability among all spy sequences is used as a threshold: any unlabeled sequence with a lower score than this threshold is labeled a reliable negative\. The final classifier is then trained using these reliable negatives and all original positives, and is used for final prediction\.

Protein\-PU:Protein\-PU\[Song et al\.,[2021](https://arxiv.org/html/2605.06879#biba.bib38)\]was originally proposed to train classifiers on deep mutational scanning \(DMS\) datasets\. In its original formulation, Protein\-PU assumes the existence of an initial sequence library, which is first sequenced with the risk of incomplete detection, hence the authors assume the initial observed library represents a subset of all experimentally generated sequences\. This library is then subjected to a selection assay, after which only positive \(functional\) sequences will be sequenced, again with imperfect detection\. As a result, positive sequences may be observed from two sources \(both before and after selection\), whereas negative sequences can only be observed from the initial library\.

The original method models this asymmetry using a logistic regression classifier with a bias term reflecting the differing observation mechanisms\. In our setting, since we do not explicitly work with DMS data, we adopt a simplified assumption: the observed positive and unlabeled sequences constitute the entire dataset\. Under this assumption, the likelihood reduces to the form presented in Section[2\.2](https://arxiv.org/html/2605.06879#S2.SS2), with a tunable detection efficiency \(or class prior\) parameterqq\.

Letπ\\pidenote the fraction of positive sequences in the full dataset\. The detection efficiencyqqis given by

q=nobsπ​\(nobs\+nunlabeled\),q=\\frac\{n\_\{\\text\{obs\}\}\}\{\\pi\\left\(n\_\{\\text\{obs\}\}\+n\_\{\\text\{unlabeled\}\}\\right\)\},wherenobsn\_\{\\text\{obs\}\}is the number of observed positive sequences andnunlabeledn\_\{\\text\{unlabeled\}\}is the number of unlabeled sequences\.

We perform a grid search overπ\\pi, starting from

π=min⁡\(2​nobsnobs\+nunlabeled,0\.5\)\\pi=\\min\\left\(2\\frac\{n\_\{\\text\{obs\}\}\}\{n\_\{\\text\{obs\}\}\+n\_\{\\text\{unlabeled\}\}\},\\,0\.5\\right\)up toπ=1\\pi=1with step size 0\.1\. To select the optimalπ\\pi, we follow the procedure proposed in the original work\. During training, observed positive sequences are treated as positive, while unlabeled sequences are treated as negative\. We then perform 10\-fold cross\-validation\. For each fold, we train the classifier using the custom loss with the specifiedπ\\pi\(and correspondingqq\) on the remaining nine folds, and evaluate it on the held\-out fold to compute the AUC, denoted as AUC\-PU\.

Since the unlabeled sequences in the held\-out fold are not true negatives, we compute the corrected AUC\[Jain et al\.,[2017](https://arxiv.org/html/2605.06879#biba.bib14)\]:

Corrected\-AUC=AUC\-PU−π/21−π\.\\text\{Corrected\-AUC\}=\\frac\{\\text\{AUC\-PU\}\-\\pi/2\}\{1\-\\pi\}\.We select the value ofπ\\pithat maximizes the average corrected AUC across the 10 folds\. Finally, we retrain the classifier on the full dataset using this selectedπ\\piand evaluate it on the test set\.

### I\.2OCC methods

For these OCC baselines, we do not incorporate any of the generated sequences\. The models are trained using only positive observed sequences for influenza and SARS\-CoV\-2 tasks and only provided MSA sequences for ProteinGym benchmarks\.

OC\-SVM:Standard OC\-SVM\[Schölkopf et al\.,[2001](https://arxiv.org/html/2605.06879#biba.bib35)\]learns a hyperplane separating the traning data from the origin\.

iForest:iForest\[Liu et al\.,[2008](https://arxiv.org/html/2605.06879#biba.bib22)\]scores anomalies based on the number of splits needed to isolate them\.

### I\.3Protein language model\-based methods

EVE: In EVE\[Frazer et al\.,[2021](https://arxiv.org/html/2605.06879#biba.bib10)\], we follow the procedures described in the original paper\. For the influenza and SARS\-CoV\-2 tasks, we first choose the most frequently observed sequence as the wild type, retrieve similar sequences from the UniRef90 database, construct an MSA, and create the training set by concatenating the relevant MSA segments\. For the ProteinGym problems, we directly use the curated MSA datasets provided by the benchmark\. We then train a variational autoencoder \(VAE\) on the one\-hot encoded MSA sequences and use it to compute an evolutionary index for each test sequence relative to the wild type\. These indices are subsequently modeled with a two\-component Gaussian mixture model \(GMM\) to predict the functional class of each sequence\.

Zero\-shot:For the influenza and SARS\-CoV\-2 tasks, we use the most frequently observed sequences as wild\-type references and compute the fitness likelihood difference between each test sequence and the wild type using the ESM\-1v model, following Eq\. \(1\) in\[Meier et al\.,[2021](https://arxiv.org/html/2605.06879#biba.bib28)\]\. For the ProteinGym benchmarks, we use the provided wild\-type sequences and follow the same likelihood computation\.

Similarity\-based method \(kNN\-ESM2\):In this baseline, we first embed all sequences—including positive, generated unlabeled, and test sequences—into a latent space using the ESM2 protein language model\[Lin et al\.,[2023](https://arxiv.org/html/2605.06879#biba.bib19)\]\. We then train akk\-nearest neighbors \(kNN\) classifier for prediction\. We consider values ofkkranging from 2 to 10\. During training, positive sequences are treated as positive, while unlabeled sequences are treated as negative\. For each choice ofkk, we perform 10\-fold cross\-validation and select the bestkkbased on the average AUC\. The final model is trained using the selectedkkon the full set of positive and unlabeled sequences and evaluated on the test set\. A similar implementation has been considered, for example, in\[Esmaili et al\.,[2025](https://arxiv.org/html/2605.06879#biba.bib9)\]\.

## Appendix JOptimization details

We implement Evo\-PU and all PU\-learning baselines in PyTorch\[Paszke et al\.,[2019](https://arxiv.org/html/2605.06879#biba.bib32)\]\. For all methods that require optimizing a loss function, we use the Adam optimizer with an initial learning rate of10−310^\{\-3\}\. To improve convergence and avoid poor local minima, we employ a cyclic learning rate schedule viaCyclicLRin Pytorch, where the learning rate oscillates between10−310^\{\-3\}and10−110^\{\-1\}in a triangular policy with a step size of 50 iterations\. During training, gradients are clipped to a maximum norm of 1\.0 to ensure numerical stability\. Optimization is performed for up to 2000 epochs, with early stopping based on the training loss: if the loss does not improve by at least10−610^\{\-6\}for 100 consecutive epochs, training is terminated\.

For Evo\-PU, the bounds ofα\\alphaare set to\(0\.00075,0\.99\)\(0\.00075,0\.99\)for influenza fusion, two ProteinGym benchmarks and RSV\-HRC tasks,\(0\.00025,0\.99\)\(0\.00025,0\.99\)for influenza binding,\(0\.0001,0\.99\)\(0\.0001,0\.99\)for influenza evasion and\(0\.008,0\.99\)\(0\.008,0\.99\)for two SARS\-CoV\-2 tasks; the bounds ofpop\_\{o\}are fixed to\(0\.01,0\.99\)\(0\.01,0\.99\)for all tasks\. For all Evo\-PU, Protein\-PU and 2Step, for training, we applyL2L\_\{2\}regularization with a penalty of 50\.

For Protein\-PU, we generate 10 unlabeled datasets and report average metric values and errors across them\. For 2Step, we use the same 10 unlabeled datasets as in Protein\-PU; for each dataset, we run 10 independent trials with different spy assignments and report average metric values with error bars across all runs\.

For OC\-SVM, iForest and k\-NN \(with ESM2 representation\), we use the Scikit\-learn implementations\[Pedregosa et al\.,[2011](https://arxiv.org/html/2605.06879#biba.bib33)\]\. The EVE model is run using the official implementation:[https://github\.com/OATML\-Markslab/EVE](https://github.com/OATML-Markslab/EVE)

## Appendix KComplementary Performance Metric: Average Precision \(AP\)

In this section, we report the average precision \(AP\) as a complementary metric to the AUC results presented in the main paper\. The AP performance for all single\-organism tasks, including \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, and \(e\) SARS\-CoV\-2\-Fusion, is shown in Figure[A9](https://arxiv.org/html/2605.06879#A11.F9)\. The AP performance for the two ProteinGym datasets is reported in Figure[A10](https://arxiv.org/html/2605.06879#A11.F10)\.

In all plots, circle markers denote the LR classifier, while square markers denote the WD classifier\. Diamond markers represent methods with their own model\-specific classifiers\. The best\-performing method for each task is highlighted in red\. For Protein\-PU and 2Step, error bars indicate variability across runs\.

Overall, Evo\-PU achieves strong performance under the AP metric across most single\-organism tasks, outperforming competing methods in the majority of cases, while remaining competitive on Influenza\-Fusion and Influenza\-Evasion\. In contrast, for the ProteinGym benchmarks, PLM\-based methods consistently achieve the best performance, aligning with the observations reported in the main paper\.

![Refer to caption](https://arxiv.org/html/2605.06879v1/x11.png)Figure A9:Average precision \(AP\) performance of all methods on single\-organism tasks including \(a\) Influenza\-Fusion, \(b\) Influenza\-Binding, \(c\) Influenza\-Evasion, \(d\) RSV\-HRC, and \(e\) SARS\-CoV\-2\-Fusion\. Circle markers denote the LR classifier, while square markers denote the WD classifier\. Diamond markers represent methods with their own model\-specific classifiers\. The best\-performing method for each task is highlighted in red\. Error bars for Protein\-PU and 2Step indicate variability across runs\. Overall, Evo\-PU achieves strong performance across most tasks under the AP metric\.![Refer to caption](https://arxiv.org/html/2605.06879v1/x12.png)Figure A10:Average precision \(AP\) performance of all methods on ProteinGym benchmarks \(a\) ProteinGym\-A0 and \(b\) ProteinGym\-PSAE\. Circle markers denote the LR classifier, while square markers denote the WD classifier\. Diamond markers represent methods with their own model\-specific classifiers\. The best\-performing method for each task is highlighted in red\. Error bars for Protein\-PU and 2Step indicate variability across runs\. PLM\-based methods achieve the strongest performance on these multi\-organism benchmarks\.\\c@NAT@ctr

## References for the appendix

- Bekker and Davis \[2020\]Jessa Bekker and Jesse Davis\.Learning from positive and unlabeled data: A survey\.*Machine Learning*, 109\(4\):719–760, 2020\.
- Bermingham et al\. \[2018\]Imogen M Bermingham, Keith J Chappell, Daniel Watterson, and Paul R Young\.The heptad repeat c domain of the respiratory syncytial virus fusion protein plays a key role in membrane fusion\.*Journal of virology*, 92\(4\):10–1128, 2018\.
- Chalapathy \[2018\]R Chalapathy\.Anomaly detection using one\-class neural networks\.*arXiv preprint arXiv:1802\.06360*, 2018\.
- Chen et al\. \[2012\]Li\-Mei Chen, Ola Blixt, James Stevens, Aleksandr S Lipatov, Charles T Davis, Brian E Collins, Nancy J Cox, James C Paulson, and Ruben O Donis\.In vitro evolution of h5n1 avian influenza virus toward human\-type receptor specificity\.*Virology*, 422\(1\):105–113, 2012\.
- Cheng et al\. \[2016\]Heng\-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al\.Wide & deep learning for recommender systems\.In*Proceedings of the 1st workshop on deep learning for recommender systems*, pages 7–10, 2016\.
- Cross et al\. \[2009\]Karen J Cross, William A Langley, Rupert J Russell, John J Skehel, and David A Steinhauer\.Composition and functions of the influenza fusion peptide\.*Protein and peptide letters*, 16\(7\):766–778, 2009\.
- Désir et al\. \[2012\]Chesner Désir, Simon Bernard, Caroline Petitjean, and Laurent Heutte\.A random forest based approach for one class classification in medical imaging\.In*Machine Learning in Medical Imaging: Third International Workshop, MLMI 2012, Held in Conjunction with MICCAI 2012, Nice, France, October 1, 2012, Revised Selected Papers 3*, pages 250–257\. Springer, 2012\.
- Erfani et al\. \[2016\]Sarah M Erfani, Sutharshan Rajasegarar, Shanika Karunasekera, and Christopher Leckie\.High\-dimensional and large\-scale anomaly detection using a linear one\-class svm with deep learning\.*Pattern Recognition*, 58:121–134, 2016\.
- Esmaili et al\. \[2025\]Farzaneh Esmaili, Yongfang Qin, Duolin Wang, and Dong Xu\.Kinase\-substrate prediction using an autoregressive model\.*Computational and Structural Biotechnology Journal*, 27:1103–1111, 2025\.
- Frazer et al\. \[2021\]Jonathan Frazer, Pascal Notin, Mafalda Dias, Aidan Gomez, Joseph K Min, Kelly Brock, Yarin Gal, and Debora S Marks\.Disease variant prediction with deep generative models of evolutionary data\.*Nature*, 599\(7883\):91–95, 2021\.
- Ghafoori and Leckie \[2020\]Zahra Ghafoori and Christopher Leckie\.Deep multi\-sphere support vector data description\.In*Proceedings of the 2020 SIAM International Conference on Data Mining*, pages 109–117\. SIAM, 2020\.
- Han et al\. \[1999\]Xing Han, David A Steinhauer, Stephen A Wharton, and Lukas K Tamm\.Interaction of mutant influenza virus hemagglutinin fusion peptides with lipid bilayers: probing the role of hydrophobic residue size in the central region of the fusion peptide\.*Biochemistry*, 38\(45\):15052–15059, 1999\.
- Hicks et al\. \[2018\]Stephanie N Hicks, Supranee Chaiwatpongsakorn, Heather M Costello, Jason S McLellan, William Ray, and Mark E Peeples\.Five residues in the apical loop of the respiratory syncytial virus fusion protein f2 subunit are critical for its fusion activity\.*Journal of virology*, 92\(15\):10–1128, 2018\.
- Jain et al\. \[2017\]Shantanu Jain, Martha White, and Predrag Radivojac\.Recovering true classifier performance in positive\-unlabeled learning\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 31, 2017\.
- Khan and Madden \[2014\]Shehroz S Khan and Michael G Madden\.One\-class classification: taxonomy of study and review of techniques\.*The Knowledge Engineering Review*, 29\(3\):345–374, 2014\.
- Lai et al\. \[2006\]Alex L Lai, Heather Park, Judith M White, and Lukas K Tamm\.Fusion peptide of influenza hemagglutinin requires a fixed angle boomerang structure for activity\.*Journal of Biological Chemistry*, 281\(9\):5760–5770, 2006\.
- Lee and Liu \[2003\]Wee Sun Lee and Bing Liu\.Learning with positive and unlabeled examples using weighted logistic regression\.In*ICML*, volume 3, pages 448–455, 2003\.
- Li et al\. \[2010\]Xiao\-Li Li, Bing Liu, and See Kiong Ng\.Negative training data can be harmful to text classification\.In*Proceedings of the 2010 conference on empirical methods in natural language processing*, pages 218–228, 2010\.
- Lin et al\. \[2023\]Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al\.Evolutionary\-scale prediction of atomic\-level protein structure with a language model\.*Science*, 379\(6637\):1123–1130, 2023\.
- Liu et al\. \[2002\]Bing Liu, Wee Sun Lee, Philip S Yu, and Xiaoli Li\.Partially supervised classification of text documents\.In*ICML*, volume 2, pages 387–394\. Sydney, NSW, 2002\.
- Liu et al\. \[2003\]Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S Yu\.Building text classifiers using positive and unlabeled examples\.In*Third IEEE international conference on data mining*, pages 179–186\. IEEE, 2003\.
- Liu et al\. \[2008\]Fei Tony Liu, Kai Ming Ting, and Zhi\-Hua Zhou\.Isolation forest\.In*2008 eighth ieee international conference on data mining*, pages 413–422\. IEEE, 2008\.
- Maines et al\. \[2011\]Taronna R Maines, Li\-Mei Chen, Neal Van Hoeven, Terrence M Tumpey, Ola Blixt, Jessica A Belser, Kortney M Gustin, Melissa B Pearce, Claudia Pappas, James Stevens, et al\.Effect of receptor binding domain mutations on receptor binding and transmissibility of avian influenza h5n1 viruses\.*Virology*, 413\(1\):139–147, 2011\.
- Manevitz and Yousef \[2001\]Larry M Manevitz and Malik Yousef\.One\-class svms for document classification\.*Journal of machine Learning research*, 2\(Dec\):139–154, 2001\.
- Martín et al\. \[1998\]Javier Martín, Stephen A Wharton, Yi Pu Lin, Darin K Takemoto, John J Skehel, Don C Wiley, and David A Steinhauer\.Studies of the binding properties of influenza hemagglutinin receptor\-site mutants\.*Virology*, 241\(1\):101–111, 1998\.
- Matrosovich et al\. \[2009\]M Matrosovich, J Stech, and H Dieter Klenk\.Influenza receptors, polymerase and host range\.*Revue scientifique et technique*, 28\(1\):203, 2009\.
- Mei and Zhu \[2015\]Suyu Mei and Hao Zhu\.A novel one\-class svm based negative data sampling method for reconstructing proteome\-wide htlv\-human protein interaction networks\.*Scientific reports*, 5\(1\):8034, 2015\.
- Meier et al\. \[2021\]Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu, and Alex Rives\.Language models enable zero\-shot prediction of the effects of mutations on protein function\.*Advances in neural information processing systems*, 34:29287–29303, 2021\.
- Munroe and Madden \[2005\]Daniel T Munroe and Michael G Madden\.Multi\-class and single\-class classification approaches to vehicle model recognition from images\.*proc\. AICS*, pages 1–11, 2005\.
- \[30\]NCBI\.Viral surveillance and subtyping interface \(vssi\)\.[https://www\.ncbi\.nlm\.nih\.gov/labs/virus/vssi/\#/](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/)\[Accessed: \(2025\-05\-15\)\]\.
- NCBI \[2026\]NCBI\.Sars\-cov\-2 data hub, 2026\.[https://www\.ncbi\.nlm\.nih\.gov/labs/virus/vssi/\#/virus?SeqType\_s=Nucleotide&VirusLineage\_ss=Severe%20acute%20respiratory%20syndrome%20coronavirus%202,%20taxid:2697049&CollectionDate\_dr=2020\-01\-01T00:00:00\.00Z%20TO%202026\-01\-21T23:59:59\.00Z&HostLineage\_ss=Homo%20sapiens%20\(human\),%20taxid:9606](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Severe%20acute%20respiratory%20syndrome%20coronavirus%202,%20taxid:2697049&CollectionDate_dr=2020-01-01T00:00:00.00Z%20TO%202026-01-21T23:59:59.00Z&HostLineage_ss=Homo%20sapiens%20(human),%20taxid:9606)\.
- Paszke et al\. \[2019\]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al\.Pytorch: An imperative style, high\-performance deep learning library\.*Advances in neural information processing systems*, 32, 2019\.
- Pedregosa et al\. \[2011\]Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al\.Scikit\-learn: Machine learning in python\.*the Journal of machine Learning research*, 12:2825–2830, 2011\.
- Qiao et al\. \[1999\]Hui Qiao, R Todd Armstrong, Grigory B Melikyan, Fredric S Cohen, and Judith M White\.A specific point mutant at position 1 of the influenza hemagglutinin fusion peptide displays a hemifusion phenotype\.*Molecular biology of the cell*, 10\(8\):2759–2769, 1999\.
- Schölkopf et al\. \[2001\]Bernhard Schölkopf, John C Platt, John Shawe\-Taylor, Alex J Smola, and Robert C Williamson\.Estimating the support of a high\-dimensional distribution\.*Neural computation*, 13\(7\):1443–1471, 2001\.
- Shu and McCauley \[2017\]Yuelong Shu and John McCauley\.Gisaid: Global initiative on sharing all influenza data–from vision to reality\.*Eurosurveillance*, 22\(13\):30494, 2017\.
- Skabar \[2003\]Andrew Skabar\.Single\-class classifier learning using neural networks: An application to the prediction of mineral deposits\.In*Proceedings of the 2003 International Conference on Machine Learning and Cybernetics \(IEEE Cat\. No\. 03EX693\)*, volume 4, pages 2127–2132\. IEEE, 2003\.
- Song et al\. \[2021\]Hyebin Song, Bennett J Bremer, Emily C Hinds, Garvesh Raskutti, and Philip A Romero\.Inferring protein sequence\-function relationships with large\-scale positive\-unlabeled learning\.*Cell systems*, 12\(1\):92–101, 2021\.
- Su et al\. \[2008\]Y Su, Xingguo Zhu, Y Wang, M Wu, and P Tien\.Evaluation of glu11 and gly8 of the h5n1 influenza hemagglutinin fusion peptide in membrane fusion using pseudotype virus and reverse genetics\.*Archives of virology*, 153:247–257, 2008\.
- Tamm et al\. \[2002\]Lukas K Tamm, Xing Han, Yinling Li, and Alex L Lai\.Structure and function of membrane fusion peptides\.*Peptide Science: Original Research on Biomolecules*, 66\(4\):249–260, 2002\.
- Tax and Duin \[1999a\]David MJ Tax and Robert PW Duin\.Data domain description using support vectors\.In*ESANN*, volume 99, pages 251–256, 1999a\.
- Tax and Duin \[1999b\]David MJ Tax and Robert PW Duin\.Support vector domain description\.*Pattern recognition letters*, 20\(11\-13\):1191–1199, 1999b\.
- Tax and Duin \[2001\]David MJ Tax and Robert PW Duin\.Uniform object generation for optimizing one\-class classifiers\.*Journal of machine learning research*, 2\(Dec\):155–173, 2001\.
- Thadani et al\. \[2023\]Nicole N Thadani, Sarah Gurev, Pascal Notin, Noor Youssef, Nathan J Rollins, Daniel Ritter, Chris Sander, Yarin Gal, and Debora S Marks\.Learning from prepandemic data to forecast viral escape\.*Nature*, 622\(7984\):818–825, 2023\.
- Wang and Stolfo \[2003\]Ke Wang and Salvatore Stolfo\.One\-class training for masquerade detection\.2003\.
- Xu et al\. \[2023\]Hongzuo Xu, Guansong Pang, Yijie Wang, and Yongjun Wang\.Deep isolation forest for anomaly detection\.*IEEE Transactions on Knowledge and Data Engineering*, 35\(12\):12591–12604, 2023\.
- Yang et al\. \[2007\]Zhi\-Yong Yang, Chih\-Jen Wei, Wing\-Pui Kong, Lan Wu, Ling Xu, David F Smith, and Gary J Nabel\.Immunization by avian h5 influenza hemagglutinin mutants with altered receptor binding specificity\.*Science*, 317\(5839\):825–828, 2007\.
- Yousef and Charkari \[2015\]Abdulaziz Yousef and Nasrollah Moghadam Charkari\.A novel method based on physicochemical properties of amino acids and one class classification algorithm for disease gene identification\.*Journal of biomedical informatics*, 56:300–306, 2015\.

Similar Articles

ProtSent: Protein Sentence Transformers

arXiv cs.LG

This article introduces ProtSent, a contrastive fine-tuning framework for protein language models that improves embedding quality for downstream tasks like remote homology detection and structural retrieval.

Evolution through large models

OpenAI Blog

This paper demonstrates that large language models trained on code can significantly enhance genetic programming mutation operators, enabling the generation of hundreds of thousands of functional Python programs for robot design in the Sodarace domain without prior training data. The approach, called Evolution through Large Models (ELM), combines LLMs with MAP-Elites to bootstrap new conditional models for context-specific artifact generation.