BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language
Summary
BioMatrix is a multimodal foundation model that unifies molecular sequences, structures, and natural language in a single decoder-only architecture, achieving state-of-the-art performance on 77 out of 80 biological tasks.
View Cached Full Text
Cached at: 06/23/26, 09:41 AM
Paper page - BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language
Source: https://huggingface.co/papers/2606.22138 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
BioMatrix is a novel multimodal foundation model that integrates molecular sequences, structures, and natural language into a unified decoder-only architecture for diverse biological tasks.
We present BioMatrix, the firstmultimodal foundation modelthat natively integrates sequences, structures, andnatural languagefor both molecules and proteins within a singledecoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage separately: those that fuse multiple modalities under a shared objective remain confined to a single entity type, while those spanning multiple entity types either omit explicit structural modeling or rely on adapter-based designs in which the model cannot natively generate the very modalities it can read. BioMatrix closes this gap by mappingmolecular sequences(supporting both SMILES and SELFIES notations),molecular structures,protein sequences,protein structures, andnatural languageinto a shared discrete token space through a unifiedtokenization scheme, so that all modalities are consumed and produced uniformly under a singlenext-token predictionobjective -- without external encoders, projection adapters, or modality-specific output heads. Built upon the Qwen3 language model (1.7B and 4B), BioMatrix is continually pretrained on 304.4 billion tokens spanning general and domain-specific text, sequence and structure views of molecules and proteins, andcross-modal corporathat interleave biomolecular entities with scientific text and link distinct entities through molecule-protein and protein-protein interaction data. After tuning on a comprehensive suite ofdownstream applicationscovering 80 tasks across 6 categories -- encompassing single-entity and multi-entity understanding and generation tasks across and within modalities -- BioMatrix achieves state-of-the-art or competitive performance on 77 out of 80 tasks, demonstrating that a single, natively multimodal generalist model can effectively match or surpass specialized approaches across a wide range of biological tasks.
View arXiv pageView PDFGitHub19Add to collection
Models citing this paper4
#### QizhiPei/BioMatrix-4B-Base Text Generation• 4B• Updatedabout 1 hour ago • 77 • 1
#### QizhiPei/BioMatrix-4B-SFT Text Generation• 4B• Updatedabout 1 hour ago • 69 • 1
#### QizhiPei/BioMatrix-1.7B-Base Text Generation• 2B• Updatedabout 1 hour ago • 76
#### QizhiPei/BioMatrix-1.7B-SFT Text Generation• 2B• Updatedabout 1 hour ago • 5
Datasets citing this paper1
#### QizhiPei/BioMatrix-SFT Viewer• Updatedabout 1 hour ago • 23.6M • 787 • 1
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.22138 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences
This paper investigates whether Brain Score, a metric comparing language model representations to human fMRI activations during reading, is truly capturing human-like language processing or merely structural similarity. The researchers train language models on diverse natural languages and non-linguistic structured data (genome, Python, nested parentheses), finding that models trained on different languages and even non-linguistic sequences achieve similar Brain Score performance, suggesting the metric may not be sensitive enough to distinguish human-specific processing.
BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models
BioTool introduces a comprehensive biomedical tool-calling dataset with 34 tools and 7,040 human-verified query-API pairs, enabling fine-tuned LLMs to outperform GPT-5.1 on biomedical tool use and significantly enhance answer quality.
Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System
BioManus is an MCP-native biomedical agent system that uses graph-scaffolded planning over structured biological capabilities instead of flat prompt-based tool retrieval, achieving better context efficiency and execution accuracy on biomedical benchmarks. The system introduces a BioinfoMCP Compiler to standardize heterogeneous bioinformatics tools and organizes them as a typed heterogeneous MCP graph for scalable reasoning.
Bayesian Model Merging
Introduces Bayesian Model Merging (BMM), a plug-and-play bi-level optimization framework for combining multiple task-specific experts into a single model, achieving state-of-the-art performance on vision and language benchmarks.
Benchmarking Biology’s AI Agent: ML@B's Collaboration with LatchBio
Machine Learning at Berkeley collaborated with LatchBio to benchmark their AI agent's performance on spatial transcriptomics workflows, evaluating its ability to automate complex bioinformatics tasks.