DeepRHP: A Hybrid Variational Autoencoder for Designing Random Heteropolymers as Protein Mimics
Summary
DeepRHP is a hybrid variational autoencoder that guides the design of random heteropolymers as protein mimics, demonstrated by stabilizing membrane proteins like Aquaporin Z in non-native environments.
View Cached Full Text
Cached at: 06/11/26, 01:51 PM
# DeepRHP: A Hybrid Variational Autoencoder for Designing Random Heteropolymers as Protein Mimics
Source: [https://arxiv.org/html/2606.11651](https://arxiv.org/html/2606.11651)
###### Abstract
Synthetic random heteropolymers \(RHPs\), consisting of a predefined set of monomers, offer an approach toward the design of protein\-like materials\. These RHPs, if designed appropriately, can mimic protein behavior and function\. As such, there is a need for computational tools to efficiently guide RHP design\. We bridge this gap by developing DeepRHP, a modified variational autoencoder \(VAE\) model under a semi\-supervised framework\. By equipping a classical VAE with an additional feature\-based VAE, DeepRHP forces the latent space to capture structures of critical chemical features as well as individual RHP sequence patterns\. In this sense, our method is versatile by allowing any relevant features to be incorporated in a hybrid manner\. We demonstrate the effectiveness of DeepRHP by suggesting potential monomer compositions that stabilize membrane proteins \(e\.g\. Aquaporin Z\) in non\-native environments and cross\-validating our prediction with published results\. The concordance between our model and true RHP function suggests strong potential in utilizing hybrid autoencoder architectures to guide RHP design for proteins and other biological compounds\.
## 1Introduction
There is a significant interest in engineering synthetic materials capable of replicating protein functions while satisfying stability and compatibility with device fabrication and integration\. However, it remains an insurmountable challenge to synthesize sequence\-specific polymers\. This has led to a recent surge of research in designing protein\-like random heteropolymers\. Random heteropolymers \(RHPs\) are an ensemble of many polymer chains with each being composed of monomers arranged in random order\(Hilburget al\.[2020](https://arxiv.org/html/2606.11651#bib.bib1)\)\. Recent developments have demonstrated that RHPs can act as chaperone proteins for protein stabilization in non\-biological environments\(Panganibanet al\.[2018](https://arxiv.org/html/2606.11651#bib.bib2)\), a critical bottleneck to fabricate protein\-embedded plastics for end\-of\-life plastic degradation\(DelReet al\.[2021](https://arxiv.org/html/2606.11651#bib.bib13)\)\. In addition, RHPs can be designed to act as channel proteins for rapid and selective proton transportation\(Jianget al\.[2020](https://arxiv.org/html/2606.11651#bib.bib3)\), important for fuel cells and energy storage\.
Despite the fact that RHPs can serve as great biofunctional materials, designing RHPs with desired function is challenging because both the exact monomeric sequences and conformations of synthetic RHP chains are not deterministic\. Traditional protein design methods rely heavily on high\-throughput sequencing data and 3D structures\. For example, directed evolution methods evolve protein function by iteratively mutating a selected protein sequence\(Arnold[2018](https://arxiv.org/html/2606.11651#bib.bib16)\), whilede novomethods build novel proteins that fold into a certain structure\(Huanget al\.[2016](https://arxiv.org/html/2606.11651#bib.bib17)\)\. Without exact sequences and structures, there are no rational design principles for creating suitably functional RHP chains\. Current RHP designs are largely empirical and depend on time\-intensive lab screenings over various monomer compositions and chain lengths\. For each RHP made in the lab, ensembles of thousands of sequences are simulated under the same monomer composition in order to understand why certain compositions perform better than others\. In this process, scientists face two practical design questions that can potentially accelerate progress if answered:
- •How many monomers should be included in a RHP system? Recent results show that RHPs can mimic protein function with only four monomers\(Panganibanet al\.[2018](https://arxiv.org/html/2606.11651#bib.bib2); Jianget al\.[2020](https://arxiv.org/html/2606.11651#bib.bib3)\), but it remains unclear how many monomers are enough to include in the alphabet\.
- •How can one find monomer compositions corresponding to specific protein functions?
Answering these questions requires new methods to model and analyze RHP sequences as an ensemble instead of as individual chains\. To our knowledge, there is very limited literature on computational methods of modeling RHPs\. As the only two examples,[Zhouet al\.](https://arxiv.org/html/2606.11651#bib.bib4)\([2022](https://arxiv.org/html/2606.11651#bib.bib4)\)used Hidden Markov Models to characterize the functionality of proton\-transporting RHPs and[Tamasiet al\.](https://arxiv.org/html/2606.11651#bib.bib14)\([2022](https://arxiv.org/html/2606.11651#bib.bib14)\)utilized Gaussian process regression coupled with Bayesian optimization for optimal copolymer identification\.
Here we propose DeepRHP, a modified variational autoencoder trained in a semi\-supervised manner, for modeling general RHP sequence data and discovering RHP compositions for protein function\. This tool serves as a first step that can guide RHP design by examining their protein\-mimicking behavior\. The key contributions of this study are:
- •We are the first to answer RHP design questions with deep learning\. DeepRHP learns interpretable latent representations for RHP sequences and provides a platform to perform similarity analysis between target proteins and RHP sequences in an ensemble\.
- •DeepRHP provides insights into the two important design parameters: monomer alphabet size and monomer composition\. We show that the best monomer composition suggested by DeepRHP matches published experimental results\.
- •DeepRHP is flexible enough to incorporate any function\-related chemical features for a wide variety of protein functions\.
VAE\-based architectures are some of the first model classes used to identify latent representations for biological sequences, and are useful in downstream tasks like identifying mutation effects\(Sinaiet al\.[2017](https://arxiv.org/html/2606.11651#bib.bib7); Riesselmanet al\.[2018](https://arxiv.org/html/2606.11651#bib.bib8)\)and designing novel functional proteins\(Greeneret al\.[2018](https://arxiv.org/html/2606.11651#bib.bib10); Costello and Martin[2019](https://arxiv.org/html/2606.11651#bib.bib9)\)\. Therefore, we should expect to leverage the same machine learning theory in macromolecular cheminformatics, specifically in this instance of using RHPs to mimic natural biopolymers\.
Figure 1:DeepRHP model architecture consisting of a classical VAE equipped with an additional feature\-based VAE\.
## 2Data
Our work utilizes the RHP system developed in both[Panganibanet al\.](https://arxiv.org/html/2606.11651#bib.bib2)\([2018](https://arxiv.org/html/2606.11651#bib.bib2)\)and[Jianget al\.](https://arxiv.org/html/2606.11651#bib.bib3)\([2020](https://arxiv.org/html/2606.11651#bib.bib3)\)\. This system consists of four methacrylate\-based monomers: methyl methacrylate \(MMA\), 2\-ethylhexyl methacrylate \(EHMA\), oligo \(ethylene glycol\) methacrylate \(OEGMA\), and 3\-sulfopropyl methacrylate potassium salt \(SPMA\)\. MMA and EHMA are the hydrophobic monomers used to tailor overall hydrophobicity, while OEGMA and SPMA are the hydrophilic monomers used to reduce the aggregation propensity of RHPs\.
We used Compositional Drift, a software developed by[Smithet al\.](https://arxiv.org/html/2606.11651#bib.bib11)\([2019](https://arxiv.org/html/2606.11651#bib.bib11)\)to simulate 10,000 sequences per monomer composition listed in Table[1](https://arxiv.org/html/2606.11651#S2.T1)\. This software uses established mathematical copolymer models in tandem with Monte\-Carlo simulation to calculate RHP sequences based on experimental conditions\. The authors showed that, while each chain simulated is random at the sequence level, it contains characteristic segments that have a well\-defined statistical distribution\(Smithet al\.[2019](https://arxiv.org/html/2606.11651#bib.bib11)\)\. The reasoning behind the monomeric compositions for each specific RHP is further discussed in Section 4\.
We also collected 30,000 membrane protein sequences and 30,000 globular protein sequences with 50% identity threshold from the UniProt database\(UniProt Consortium[2020](https://arxiv.org/html/2606.11651#bib.bib12)\)\. Some common pre\-processing procedures were performed, including discarding sequences with uncommon amino acids and lengths\. Each protein was then reduced into its monomer\-equivalent form according to the assignment in Table[2](https://arxiv.org/html/2606.11651#S2.T2)\. Note that the reduction of protein alphabet is not uncommon in protein sequence analysis, see[Lianget al\.](https://arxiv.org/html/2606.11651#bib.bib15)\([2022](https://arxiv.org/html/2606.11651#bib.bib15)\)for a comprehensive review\. Here our reduction rule is based on monomer hydrophobicity and charge\.
Table 1:Two and four\-monomer composition of RHPs used for trainingTable 2:Amino acid \(protein\) to monomer \(RHP\) conversion
## 3DeepRHP Methodology
In order to address the domain questions raised in Section 1, we developed DeepRHP, a modified variational autoencoder under semi\-supervised framework for learning low\-dimensional RHP sequence representations\. The model architecture is illustrated in Figure[1](https://arxiv.org/html/2606.11651#S1.F1)\. We assume the sequence familyXXfollows a probability distributionp\(x\)p\(x\)and there exists an underlying latent variablez∼N\(μz,Σz\)z\\sim N\(\\mu\_\{z\},\\Sigma\_\{z\}\)that captures intrinsic unobserved sequence properties\. For each sequencexx, there also exists a function\-related featureyy, which can be considered as a deterministic transformation ofxx\. In the application case presented in Section 4,yyis the average hydrophilic–lipophilic balance \(HLB\) value of sliding windows along each sequence\(Kyte and Doolittle[1982](https://arxiv.org/html/2606.11651#bib.bib18)\)\. HLB measures local hydrophobicity and solubility distributions and is closely related to RHP functions\(Panganibanet al\.[2018](https://arxiv.org/html/2606.11651#bib.bib2); Jianget al\.[2020](https://arxiv.org/html/2606.11651#bib.bib3)\)\. The motivation for introducing other function\-related chemical features \(e\.g\. HLB\) is for them to guide the formation of the latent space\.
To incorporate a chemical featureyyinto our VAE model, we add a feature\-driven VAE in parallel with the classical VAE\.yyandxxshare the common latent variablezz\. This is equivalent to simultaneously training two VAEs with shared latent embeddings, and the encoder relies only onxxsinceyyis a direct transformation ofxx, as indicated by the dashed lines in Figure[1](https://arxiv.org/html/2606.11651#S1.F1)\.
The objective is still to maximize the log\-likelihoodlogp\(x\)\\log p\(x\)given sequence dataXXas shown in equation:
logp\(x\)=log∫p\(x∣z\)p\(z\)𝑑z\.\\log p\(x\)=\\log\\int p\(x\\mid z\)~p\(z\)~dz\.\(1\)Under the regular VAE setting, Equation[1](https://arxiv.org/html/2606.11651#S3.E1)can be bound by the well\-known evidence lower bound \(ELBO\)\(Kingma and Welling[2013](https://arxiv.org/html/2606.11651#bib.bib5); Rezendeet al\.[2014](https://arxiv.org/html/2606.11651#bib.bib6)\):
logp\(x\)≥𝔼q\[logp\(x∣z\)\]−DKL\(q\(z∣x\)∥p\(z\)\),\\log p\(x\)\\geq\\mathbb\{E\}\_\{q\}\\left\[\\log p\(x\\mid z\)\\right\]\-D\_\{KL\}\\left\(q\(z\\mid x\)~\\\|~p\(z\)\\right\),\(2\)whereqqis the learned posterior of the normal distribution family\. In practice,ppandqqare learned by the encoder and decoder and their weights are optimized through gradient descent\.
Traditionally, the reconstruction loss term is approximated by mean\-squared error for continuous input, or cross\-entropy loss for discrete input\. By imposing this hybrid architecture, we can approximate the reconstruction loss through both the classical VAE onxx, the feature\-driven VAE onyy, or a weighted sum of both\. Our modified ELBO that considers both sequence structures and chemical features is then formulated as
logp\(x\)≥\\displaystyle\\log p\(x\)\\geq\\α𝔼q\[logp\(x∣z\)\]\+\(1−α\)𝔼q\[logp\(y∣z\)\]−\\displaystyle\\alpha\\mathbb\{E\}\_\{q\}\\left\[\\log p\(x\\mid z\)\\right\]\+\(1\-\\alpha\)\\mathbb\{E\}\_\{q\}\\left\[\\log p\(y\\mid z\)\\right\]\-DKL\(q\(z∣x\)∥p\(z\)\),\\displaystyle D\_\{KL\}\\left\(q\(z\\mid x\)~\\\|~p\(z\)\\right\),\(3\)whereα\\alphais a hyperparameter that dictates how much weight is placed on each approximation term\. In our case, the first two terms of Equation[3](https://arxiv.org/html/2606.11651#S3.Ex1)are approximated as follows:
𝔼q\[logp\(x∣z\)\]≈∑x∑lp\(xl\)∗log\(p\(xl∣z\)\\mathbb\{E\}\_\{q\}\\left\[\\log p\(x\\mid z\)\\right\]\\approx\\sum\_\{x\}\\sum\_\{l\}p\(x\_\{l\}\)\*\\log\(p\(x\_\{l\}\\mid z\)\(4\)𝔼q\[logp\(y∣z\)\]\)≈−∑y\|\|y−y′\|\|22,\\mathbb\{E\}\_\{q\}\\left\[\\log p\(y\\mid z\)\\right\]\)\\approx\-\\sum\_\{y\}\|\|y\-y^\{\\prime\}\|\|\_\{2\}^\{2\},\(5\)wherey′y^\{\\prime\}is the output of feature\-based decoder denoted by the blue shading in Figure[1](https://arxiv.org/html/2606.11651#S1.F1)\.
By optimizing the reconstruction loss in this hybrid manner, we obtain a meaningful low\-dimensional latent space that captures the sequence structure relevant to the desired protein function\. Additionally, our method comes with interpretability benefits that classical VAEs often lack\. Existing works usually concatenate all features together into a single vector for the encoder\. The resulting latent space is then obscured, as no physical meanings can be derived for the principal directions\. In contrast, our hybrid training leads to meaningful visualizations of the data because the latent variables are directly linked to the chemical features\.
Both the encoder and the decoder were implemented with multilayer perceptrons using PyTorch\. Each has three fully connected layers with 256, 128, and 64 hidden units, respectively\. The feature decoder has two fully connected layers with 32 hidden units\. ReLU activation functions were used as non\-linearities throughout the network, except in the output layer of the decoder where Sigmoid activation was used instead\. The model was trained using the ADAM optimizer with a learning rate of 0\.0001\. A learning rate scheduler was used when validation loss stopped improving\.
## 4Results and Discussion
Figure 2:PCA projections of RHP and protein latent factors\. Panels \(a\) and \(b\) project membrane and globular proteins onto two and four\-monomer RHP space, respectively\. Panels \(c\) and \(d\) project AqpZ onto the same two RHP spaces\.Aquaporins \(Aqp\) are membrane channel proteins that facilitate water transport between cells\. Membrane proteins are unstable and prone to aggregation even under mild experimental conditions\.[Panganibanet al\.](https://arxiv.org/html/2606.11651#bib.bib2)successfully stabilized Aquaporin Z \(AqpZ\) and preserved its function in non\-native environments with the presence of RHPs\. We demonstrate how DeepRHP can be used to accelerate RHP design by identifying promising monomer compositions\.
[Panganibanet al\.](https://arxiv.org/html/2606.11651#bib.bib2)\([2018](https://arxiv.org/html/2606.11651#bib.bib2)\)chose to use 70% hydrophobic monomers and 30% hydrophilic monomers in their RHP system based on a crude protein surface analysis on four protein sequences\. We first validate this distribution of monomer hydrophobicities using our model\. The latent factors of the two\-monomer RHPs and natural proteins are projected onto a two\-dimensional space using Principal Component Analysis \(PCA\), as shown in Figure[2](https://arxiv.org/html/2606.11651#S4.F2)\(a\)\. All two\-monomer RHPs are composed of one hydrophobic monomer \(EHMA\) and one hydrophilic monomer \(OEGMA\)\. The compositions ofRHP AthroughRHP Elisted in Table[1](https://arxiv.org/html/2606.11651#S2.T1)are selected to sufficiently reflect this hydrophobicity range\. We observe that PC1 correlates with hydrophobicity as RHPs span left to right, with left being least hydrophobic to right being most hydrophobic\. The majority of membrane and globular proteins overlap withRHP BandRHP C, suggesting these two RHP compositions are most similar to natural proteins\. On the other hand, most hydrophobic membrane proteins overlap withRHP B\(30% hydrophilic, 70% hydrophobic\), confirming that 30:70 is a good balance for the two\-monomer system\.
We then fine\-tune the performance of the 30:70 distribution of hydrophilic and hydrophobic monomers by increasing the number of monomers from two to four as shown in Figure[2](https://arxiv.org/html/2606.11651#S4.F2)\(b\)\. A library of four\-monomer\-based RHPs was designed by varying the MMA:EHMA ratio\. The specific monomer composition is shown in Table[1](https://arxiv.org/html/2606.11651#S2.T1)\. Each ofRHP 1throughRHP 7is still composed of 30% hydrophilic monomers \(OEGMA \+ SPMA\) and 70% hydrophobic monomers \(MMA \+ EHMA\)\.
[Panganibanet al\.](https://arxiv.org/html/2606.11651#bib.bib2)\([2018](https://arxiv.org/html/2606.11651#bib.bib2)\)did not rationalize the choice of four monomers for their design of protein\-like RHPs\. Our approach explains why the two\-monomer alphabet size is insufficient\. In Figure[2](https://arxiv.org/html/2606.11651#S4.F2)\(b\), each of the RHP ensembles can be considered as a subset ofRHP Band occupies a much more localized natural protein sequence space with smaller variance\. In Figures[2](https://arxiv.org/html/2606.11651#S4.F2)\(c\) and \(d\), we project AqpZ onto the two\-monomer and four\-monomer PCA spaces, respectively\. In the two\-monomer setting, theRHP Bspace is much larger than the span of AqpZ\. In the four\-monomer setting, however, the AqpZ projections cover theRHP 4andRHP 5spaces almost entirely\. Therefore, we believe the two\-monomer sequence space is too broad with respect to proteins while the four\-monomer sequence space is more localized, offering stability in synthesizing RHPs\.
In addition to providing heuristics regarding the number of monomers, DeepRHP sheds light on the choice of monomer compositions\. In Figure[2](https://arxiv.org/html/2606.11651#S4.F2)\(d\), there is a large overlap between the projected proteins and theRHP 4andRHP 5contours\. Wet\-lab experiments in[Panganibanet al\.](https://arxiv.org/html/2606.11651#bib.bib2)\([2018](https://arxiv.org/html/2606.11651#bib.bib2)\)demonstrated that the optimal RHP has the same monomer ratio as that ofRHP 4and is capable of stabilizing AqpZ\. Thus, the overlap between RHPs and AqpZ in the PCA space can modulate their sequence correlation and molecular interactions in the aqueous solution\. This indicates that the latent embeddings discovered by DeepRHP are chemically meaningful and play a key role in discovering RHPs that provide strong performance\.
## 5Conclusion
In this study, we developed DeepRHP, a hybrid variational autoencoder model to guide RHP design\. Our model suggests the feasibility of four\-monomer compositions to stabilize ApqZ, matching the respective wet\-lab experiment\. In ablation studies, our model outperforms a singular classical VAE without the additional decoding regressor\.
Overall, DeepRHP holds much promise for the future of integrating deep learning techniques, specifically VAEs, into RHP design\. Hybrid VAE architectures like DeepRHP possess many advantages\. First, they are flexible and can be trained on any sequence family with variable sequence lengths and no multiple sequence alignment is needed\. DeepRHP is also flexible due to its flexibility in supervision\. It can be totally unsupervised when no prior knowledge on RHP subpopulations is available, or it can also be semi\-supervised by combining function\-related chemical features with vast amounts of sequence data to improve interpretability of latent variables\.
Future work in this regime includes strengthening the quantitative assessment of DeepRHP\. Our model is currently assessed in a qualitative manner and validated using laboratory results\. We hope to improve DeepRHP by developing a quantitative measure to evaluate the quality of the latent representations\. For instance, we hope to complete further downstream tasks such as classifying specific membrane proteins and evaluating similarities between each RHP and their target proteins\.
## 6Acknowledgments
This work is supported by the U\.S\. Department of Defense \(DOD\), Army Research Office, under contract W911NF\-13\-1\-0232 and the National Science Foundation under grant number DGE 2146752\. We thank Yaodong Yu and Peter Bickel for useful discussions and comments on the model formulation\.
## References
- F\. H\. Arnold \(2018\)Directed evolution: bringing new chemistry to life\.Angewandte Chemie \- International Edition57,pp\. 4143–4148\.External Links:[Document](https://dx.doi.org/10.1002/anie.201708408),ISSN 15213773Cited by:[§1](https://arxiv.org/html/2606.11651#S1.p2.1)\.
- Z\. Costello and H\. G\. Martin \(2019\)How to hallucinate functional proteins\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.1903.00458),[Link](https://arxiv.org/abs/1903.00458)Cited by:[§1](https://arxiv.org/html/2606.11651#S1.p6.1)\.
- C\. DelRe, Y\. Jiang, P\. Kang, J\. Kwon, A\. Hall, I\. Jayapurna, Z\. Ruan, L\. Ma, K\. Zolkin, T\. Li, C\. D\. Scown, R\. O\. Ritchie, T\. P\. Russell, and T\. Xu \(2021\)Near\-complete depolymerization of polyesters with nano\-dispersed enzymes\.Nature592,pp\. 558–563\.External Links:[Document](https://dx.doi.org/10.1038/s41586-021-03408-3),ISSN 1476\-4687,[Link](https://www.nature.com/articles/s41586-021-03408-3)Cited by:[§1](https://arxiv.org/html/2606.11651#S1.p1.1)\.
- J\. G\. Greener, L\. Moffat, and D\. T\. Jones \(2018\)Design of metalloproteins and novel protein folds using variational autoencoders\.Scientific Reports8,pp\. 16189\.External Links:[Document](https://dx.doi.org/10.1038/s41598-018-34533-1),ISSN 2045\-2322,[Link](https://www.nature.com/articles/s41598-018-34533-1)Cited by:[§1](https://arxiv.org/html/2606.11651#S1.p6.1)\.
- S\. L\. Hilburg, Z\. Ruan, T\. Xu, and A\. Alexander\-Katz \(2020\)Behavior of protein\-inspired synthetic random heteropolymers\.Macromolecules53,pp\. 9187–9199\.External Links:[Document](https://dx.doi.org/10.1021/ACS.MACROMOL.0C01886/SUPPL%5FFILE/MA0C01886%5FSI%5F001.PDF),ISSN 15205835,[Link](https://pubs.acs.org/doi/abs/10.1021/acs.macromol.0c01886)Cited by:[§1](https://arxiv.org/html/2606.11651#S1.p1.1)\.
- P\. S\. Huang, S\. E\. Boyken, and D\. Baker \(2016\)The coming of age of de novo protein design\.Nature537,pp\. 320–327\.External Links:[Document](https://dx.doi.org/10.1038/nature19946),ISSN 1476\-4687,[Link](https://www.nature.com/articles/nature19946)Cited by:[§1](https://arxiv.org/html/2606.11651#S1.p2.1)\.
- T\. Jiang, A\. Hall, M\. Eres, Z\. Hemmatian, B\. Qiao, Y\. Zhou, Z\. Ruan, A\. D\. Couse, W\. T\. Heller, H\. Huang, M\. O\. de la Cruz, M\. Rolandi, and T\. Xu \(2020\)Single\-chain heteropolymers transport protons selectively and rapidly\.Nature577,pp\. 216–220\.External Links:[Document](https://dx.doi.org/10.1038/s41586-019-1881-0),ISSN 1476\-4687,[Link](https://www.nature.com/articles/s41586-019-1881-0)Cited by:[1st item](https://arxiv.org/html/2606.11651#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2606.11651#S1.p1.1),[§2](https://arxiv.org/html/2606.11651#S2.p1.1),[§3](https://arxiv.org/html/2606.11651#S3.p1.7)\.
- D\. P\. Kingma and M\. Welling \(2013\)Auto\-encoding variational bayes\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.1312.6114),[Link](https://arxiv.org/abs/1312.6114)Cited by:[§3](https://arxiv.org/html/2606.11651#S3.p3.6)\.
- J\. Kyte and R\. F\. Doolittle \(1982\)A simple method for displaying the hydropathic character of a protein\.Journal of Molecular Biology157,pp\. 105–132\.External Links:[Document](https://dx.doi.org/10.1016/0022-2836%2882%2990515-0),ISSN 00222836Cited by:[§3](https://arxiv.org/html/2606.11651#S3.p1.7)\.
- Y\. Liang, S\. Yang, L\. Zheng, H\. Wang, J\. Zhou, S\. Huang, L\. Yang, and Y\. Zuo \(2022\)Research progress of reduced amino acid alphabets in protein analysis and prediction\.Computational and Structural Biotechnology Journal20,pp\. 3503–3510\.External Links:[Document](https://dx.doi.org/10.1016/J.CSBJ.2022.07.001),ISSN 2001\-0370Cited by:[§2](https://arxiv.org/html/2606.11651#S2.p3.1)\.
- B\. Panganiban, B\. Qiao, T\. Jiang, C\. DelRe, M\. M\. Obadia, T\. D\. Nguyen, A\. A\.A\. Smith, A\. Hall, I\. Sit, M\. G\. Crosby, P\. B\. Dennis, E\. Drockenmuller, M\. O\. D\. L\. Cruz, and T\. Xu \(2018\)Random heteropolymers preserve protein function in foreign environments\.Science359,pp\. 1239–1243\.External Links:[Document](https://dx.doi.org/10.1126/SCIENCE.AAO0335/SUPPL%5FFILE/AAO0335S2B.MP4),ISSN 10959203,[Link](https://www.science.org/doi/10.1126/science.aao0335)Cited by:[1st item](https://arxiv.org/html/2606.11651#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2606.11651#S1.p1.1),[§2](https://arxiv.org/html/2606.11651#S2.p1.1),[§3](https://arxiv.org/html/2606.11651#S3.p1.7),[§4](https://arxiv.org/html/2606.11651#S4.p1.1),[§4](https://arxiv.org/html/2606.11651#S4.p2.1),[§4](https://arxiv.org/html/2606.11651#S4.p4.1),[§4](https://arxiv.org/html/2606.11651#S4.p5.1)\.
- D\. J\. Rezende, S\. Mohamed, and D\. Wierstra \(2014\)Stochastic backpropagation and approximate inference in deep generative models\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.1401.4082),[Link](https://arxiv.org/abs/1401.4082)Cited by:[§3](https://arxiv.org/html/2606.11651#S3.p3.6)\.
- A\. J\. Riesselman, J\. B\. Ingraham, and D\. S\. Marks \(2018\)Deep generative models of genetic variation capture the effects of mutations\.Nature Methods15,pp\. 816–822\.External Links:[Document](https://dx.doi.org/10.1038/s41592-018-0138-4),ISBN 4159201801384,ISSN 1548\-7105,[Link](https://www.nature.com/articles/s41592-018-0138-4)Cited by:[§1](https://arxiv.org/html/2606.11651#S1.p6.1)\.
- S\. Sinai, E\. Kelsic, G\. M\. Church, and M\. A\. Nowak \(2017\)Variational auto\-encoding of protein sequences\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.1712.03346),[Link](https://arxiv.org/abs/1712.03346)Cited by:[§1](https://arxiv.org/html/2606.11651#S1.p6.1)\.
- A\. A\.A\. Smith, A\. Hall, V\. Wu, and T\. Xu \(2019\)Practical prediction of heteropolymer composition and drift\.ACS Macro Letters8,pp\. 36–40\.External Links:[Document](https://dx.doi.org/10.1021/ACSMACROLETT.8B00813/SUPPL%5FFILE/MZ8B00813%5FSI%5F001.PDF),ISSN 21611653,[Link](https://pubs.acs.org/doi/abs/10.1021/acsmacrolett.8b00813)Cited by:[§2](https://arxiv.org/html/2606.11651#S2.p2.1)\.
- M\. J\. Tamasi, R\. A\. Patel, C\. H\. Borca, S\. Kosuri, H\. Mugnier, R\. Upadhya, N\. S\. Murthy, M\. A\. Webb, A\. J\. Gormley, M\. J\. Tamasi, S\. Kosuri, H\. Mugnier, R\. Upadhya, N\. S\. Murthy, A\. J\. Gormley, R\. A\. Patel, C\. H\. Borca, and M\. A\. Webb \(2022\)Machine learning on a robotic platform for the design of polymer–protein hybrids\.Advanced Materials34,pp\. 2201809\.External Links:[Document](https://dx.doi.org/10.1002/ADMA.202201809),ISSN 1521\-4095,[Link](https://onlinelibrary.wiley.com/doi/full/10.1002/adma.202201809%20https://onlinelibrary.wiley.com/doi/abs/10.1002/adma.202201809%20https://onlinelibrary.wiley.com/doi/10.1002/adma.202201809)Cited by:[§1](https://arxiv.org/html/2606.11651#S1.p3.1)\.
- T\. UniProt Consortium \(2020\)UniProt: the universal protein knowledgebase in 2021\.Nucleic Acids Research49,pp\. D480–D489\.External Links:ISSN 0305\-1048,[Document](https://dx.doi.org/10.1093/nar/gkaa1100),[Link](https://doi.org/10.1093/nar/gkaa1100),https://academic\.oup\.com/nar/article\-pdf/49/D1/D480/35364103/gkaa1100\.pdfCited by:[§2](https://arxiv.org/html/2606.11651#S2.p3.1)\.
- Y\. Zhou, B\. Gong, T\. Jiang, T\. Xu, and H\. Huang \(2022\)Stochastic variational methods in generalized hidden semi\-markov models to characterize functionality in random heteropolymers\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2207.01813),[Link](https://arxiv.org/abs/2207.01813)Cited by:[§1](https://arxiv.org/html/2606.11651#S1.p3.1)\.Similar Articles
Deep Learning for Protein Complex Prediction and Design
This PhD thesis introduces deep learning methods for protein complex prediction and design, including GLINTER for contact prediction, ESMPair for homolog pairing, and RedNet for binder design.
From Holo Pockets to Electron Density: GPT-style Drug Design with Density
This paper introduces EDMolGPT, an autoregressive framework that generates 3D molecular conformations from low-resolution electron density point clouds, improving structure-based drug design by leveraging physically meaningful density signals.
APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization
APCyc is a target-aware generative framework that designs cyclic peptides with controlled physicochemical properties by explicitly modeling cyclization patterns and using Bayesian posterior guidance.
IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder
IDEAL proposes an in-depth alignment framework for discrete representation autoencoding, jointly aligning quantized tokens with shallow and deep VFM features to achieve superior reconstruction and generation performance.
AgForce Enables Antigen-conditioned Generative Antibody Design
This paper identifies three failure modes in existing antibody design methods (antigen blindness, vocabulary collapse, convergence to marginal distribution) and proposes AgForce, a novel encoder-decoder architecture using graph neural networks and mixture density networks, achieving state-of-the-art binding quality and sequence recovery on the Chimera-Bench benchmark.