BioMatrix: 配列・構造・言語のモダリティ行列を網羅する包括的生物学基盤モデルに向けて

要旨

本稿では、分子とタンパク質の配列、構造、自然言語を単一のデコーダのみのアーキテクチャにネイティブに統合した、初のマルチモーダル基盤モデル「BioMatrix」を提案する。既存の生物学的基盤モデルは、ネイティブなマルチモーダル性と広範なエンティティカバレッジを別々に追求している。すなわち、共通の目的の下で複数のモダリティを融合するものは単一のエンティティタイプに限定され、複数のエンティティタイプにまたがるものは、明示的な構造モデリングを省略するか、モデルが読み取り可能なモダリティをネイティブに生成できないアダプタベースの設計に依存している。BioMatrixは、分子配列（SMILESおよびSELFIES表記法に対応）、分子構造、タンパク質配列、タンパク質構造、および自然言語を、統一されたトークン化スキームを通じて共有の離散トークン空間にマッピングすることで、このギャップを解消する。これにより、外部エンコーダー、投影アダプター、モダリティ固有の出力ヘッドを必要とせず、すべてのモダリティが単一の次トークン予測目的の下で統一的に消費・生成される。BioMatrixは、Qwen3言語モデル（1.7Bおよび4B）を基盤とし、一般的・領域特化テキスト、分子およびタンパク質の配列と構造のビュー、さらに生体分子エンティティと科学テキストをインターリーブし、分子-タンパク質およびタンパク質-タンパク質相互作用データを通じて異種エンティティをリンクするクロスモーダルコーパスにわたる、3044億トークンで継続事前学習が行われる。6カテゴリ80タスクに及ぶ包括的な下流アプリケーションスイートでのチューニング後、BioMatrixは80タスク中77タスクで最先端または競争力のある性能を達成し、単一のネイティブマルチモーダル汎用モデルが幅広い生物学タスクにおいて専門的なアプローチに効果的に匹敵するか、それを上回ることを示している。

English

We present BioMatrix, the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage separately: those that fuse multiple modalities under a shared objective remain confined to a single entity type, while those spanning multiple entity types either omit explicit structural modeling or rely on adapter-based designs in which the model cannot natively generate the very modalities it can read. BioMatrix closes this gap by mapping molecular sequences (supporting both SMILES and SELFIES notations), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space through a unified tokenization scheme, so that all modalities are consumed and produced uniformly under a single next-token prediction objective -- without external encoders, projection adapters, or modality-specific output heads. Built upon the Qwen3 language model (1.7B and 4B), BioMatrix is continually pretrained on 304.4 billion tokens spanning general and domain-specific text, sequence and structure views of molecules and proteins, and cross-modal corpora that interleave biomolecular entities with scientific text and link distinct entities through molecule-protein and protein-protein interaction data. After tuning on a comprehensive suite of downstream applications covering 80 tasks across 6 categories -- encompassing single-entity and multi-entity understanding and generation tasks across and within modalities -- BioMatrix achieves state-of-the-art or competitive performance on 77 out of 80 tasks, demonstrating that a single, natively multimodal generalist model can effectively match or surpass specialized approaches across a wide range of biological tasks.