BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Hugging Face Daily Papers 06/20/26, 12:00 AM Papers

multimodal foundation-model biology sequences structures language tokenization

Summary

BioMatrix is a multimodal foundation model that unifies molecular sequences, structures, and natural language in a single decoder-only architecture, achieving state-of-the-art performance on 77 out of 80 biological tasks.

We present BioMatrix, the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage separately: those that fuse multiple modalities under a shared objective remain confined to a single entity type, while those spanning multiple entity types either omit explicit structural modeling or rely on adapter-based designs in which the model cannot natively generate the very modalities it can read. BioMatrix closes this gap by mapping molecular sequences (supporting both SMILES and SELFIES notations), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space through a unified tokenization scheme, so that all modalities are consumed and produced uniformly under a single next-token prediction objective -- without external encoders, projection adapters, or modality-specific output heads. Built upon the Qwen3 language model (1.7B and 4B), BioMatrix is continually pretrained on 304.4 billion tokens spanning general and domain-specific text, sequence and structure views of molecules and proteins, and cross-modal corpora that interleave biomolecular entities with scientific text and link distinct entities through molecule-protein and protein-protein interaction data. After tuning on a comprehensive suite of downstream applications covering 80 tasks across 6 categories -- encompassing single-entity and multi-entity understanding and generation tasks across and within modalities -- BioMatrix achieves state-of-the-art or competitive performance on 77 out of 80 tasks, demonstrating that a single, natively multimodal generalist model can effectively match or surpass specialized approaches across a wide range of biological tasks.

Original Article

View Cached Full Text

Cached at: 06/23/26, 09:41 AM

Paper page - BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Source: https://huggingface.co/papers/2606.22138 Authors:

Abstract

BioMatrix is a novel multimodal foundation model that integrates molecular sequences, structures, and natural language into a unified decoder-only architecture for diverse biological tasks.

We present BioMatrix, the firstmultimodal foundation modelthat natively integrates sequences, structures, andnatural languagefor both molecules and proteins within a singledecoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage separately: those that fuse multiple modalities under a shared objective remain confined to a single entity type, while those spanning multiple entity types either omit explicit structural modeling or rely on adapter-based designs in which the model cannot natively generate the very modalities it can read. BioMatrix closes this gap by mappingmolecular sequences(supporting both SMILES and SELFIES notations),molecular structures,protein sequences,protein structures, andnatural languageinto a shared discrete token space through a unifiedtokenization scheme, so that all modalities are consumed and produced uniformly under a singlenext-token predictionobjective -- without external encoders, projection adapters, or modality-specific output heads. Built upon the Qwen3 language model (1.7B and 4B), BioMatrix is continually pretrained on 304.4 billion tokens spanning general and domain-specific text, sequence and structure views of molecules and proteins, andcross-modal corporathat interleave biomolecular entities with scientific text and link distinct entities through molecule-protein and protein-protein interaction data. After tuning on a comprehensive suite ofdownstream applicationscovering 80 tasks across 6 categories -- encompassing single-entity and multi-entity understanding and generation tasks across and within modalities -- BioMatrix achieves state-of-the-art or competitive performance on 77 out of 80 tasks, demonstrating that a single, natively multimodal generalist model can effectively match or surpass specialized approaches across a wide range of biological tasks.

View arXiv page View PDF GitHub19 Add to collection

Models citing this paper4

#### QizhiPei/BioMatrix-4B-Base Text Generation• 4B• Updatedabout 1 hour ago • 77 • 1 #### QizhiPei/BioMatrix-4B-SFT Text Generation• 4B• Updatedabout 1 hour ago • 69 • 1 #### QizhiPei/BioMatrix-1.7B-Base Text Generation• 2B• Updatedabout 1 hour ago • 76 #### QizhiPei/BioMatrix-1.7B-SFT Text Generation• 2B• Updatedabout 1 hour ago • 5

Datasets citing this paper1

#### QizhiPei/BioMatrix-SFT Viewer• Updatedabout 1 hour ago • 23.6M • 787 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.22138 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences

arXiv cs.CL

This paper investigates whether Brain Score, a metric comparing language model representations to human fMRI activations during reading, is truly capturing human-like language processing or merely structural similarity. The researchers train language models on diverse natural languages and non-linguistic structured data (genome, Python, nested parentheses), finding that models trained on different languages and even non-linguistic sequences achieve similar Brain Score performance, suggesting the metric may not be sensitive enough to distinguish human-specific processing.

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

arXiv cs.CL

BioTool introduces a comprehensive biomedical tool-calling dataset with 34 tools and 7,040 human-verified query-API pairs, enabling fine-tuned LLMs to outperform GPT-5.1 on biomedical tool use and significantly enhance answer quality.

Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System

arXiv cs.AI

BioManus is an MCP-native biomedical agent system that uses graph-scaffolded planning over structured biological capabilities instead of flat prompt-based tool retrieval, achieving better context efficiency and execution accuracy on biomedical benchmarks. The system introduces a BioinfoMCP Compiler to standardize heterogeneous bioinformatics tools and organizes them as a typed heterogeneous MCP graph for scalable reasoning.

Bayesian Model Merging

arXiv cs.LG

Introduces Bayesian Model Merging (BMM), a plug-and-play bi-level optimization framework for combining multiple task-specific experts into a single model, achieving state-of-the-art performance on vision and language benchmarks.

Benchmarking Biology’s AI Agent: ML@B's Collaboration with LatchBio

ML at Berkeley

Machine Learning at Berkeley collaborated with LatchBio to benchmark their AI agent's performance on spatial transcriptomics workflows, evaluating its ability to automate complex bioinformatics tasks.