BioMatrix: 迈向涵盖序列、结构与语言模态矩阵的综合性生物基础模型
BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language
June 20, 2026
作者: Qizhi Pei, Zhimeng Zhou, Yi Duan, Yiyang Zhao, Wei Li, Han Guo, Liang He, Chengping Li, Chang-Yu Hsieh, Conghui He, Rui Yan, Lijun Wu
cs.AI
摘要
我们提出了BioMatrix,这是首个原生整合序列、结构与自然语言的多模态基础模型,针对分子和蛋白质采用纯解码器架构。现有生物基础模型分别追求原生多模态化和广泛实体覆盖:那些在统一目标下融合多种模态的模型仍局限于单一实体类型,而那些覆盖多种实体类型的模型要么省略显式结构建模,要么依赖适配器设计,导致模型无法原生生成其可读取的模态。BioMatrix通过将分子序列(支持SMILES和SELFIES表示法)、分子结构、蛋白质序列、蛋白质结构以及自然语言,经统一分词方案映射到共享离散标记空间,从而填补了这一空白——所有模态均在单一的下一个标记预测目标下统一消费与生成,无需外部编码器、投影适配器或特定模态的输出头。基于Qwen3语言模型(1.7B和4B),BioMatrix持续预训练于3044亿个标记,涵盖通用与领域特定文本、分子和蛋白质的序列与结构视图,以及跨模态语料库(交织生物分子实体与科学文本,并通过分子-蛋白质和蛋白质-蛋白质相互作用数据链接不同实体)。在对涵盖6大类80项任务(包括跨模态与模态内的单实体和多实体理解与生成任务)的下游应用进行微调后,BioMatrix在80项任务中的77项上达到了最先进或具有竞争力的性能,表明一个单一、原生多模态的通才模型能够有效匹配或超越各种生物任务中的专门化方法。
English
We present BioMatrix, the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage separately: those that fuse multiple modalities under a shared objective remain confined to a single entity type, while those spanning multiple entity types either omit explicit structural modeling or rely on adapter-based designs in which the model cannot natively generate the very modalities it can read. BioMatrix closes this gap by mapping molecular sequences (supporting both SMILES and SELFIES notations), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space through a unified tokenization scheme, so that all modalities are consumed and produced uniformly under a single next-token prediction objective -- without external encoders, projection adapters, or modality-specific output heads. Built upon the Qwen3 language model (1.7B and 4B), BioMatrix is continually pretrained on 304.4 billion tokens spanning general and domain-specific text, sequence and structure views of molecules and proteins, and cross-modal corpora that interleave biomolecular entities with scientific text and link distinct entities through molecule-protein and protein-protein interaction data. After tuning on a comprehensive suite of downstream applications covering 80 tasks across 6 categories -- encompassing single-entity and multi-entity understanding and generation tasks across and within modalities -- BioMatrix achieves state-of-the-art or competitive performance on 77 out of 80 tasks, demonstrating that a single, natively multimodal generalist model can effectively match or surpass specialized approaches across a wide range of biological tasks.