BioMatrix: 서열, 구조, 언어의 모달리티 행렬을 포괄하는 종합적인 생물학 기반 모델을 향하여

초록

우리는 생체 분자와 단백질에 대한 서열, 구조 및 자연어를 단일 디코더 전용 아키텍처 내에서 본질적으로 통합하는 최초의 멀티모달 기반 모델인 BioMatrix를 제시합니다. 기존의 생물학적 기반 모델들은 네이티브 멀티모달(native multimodality)과 광범위한 개체 커버리지를 개별적으로 추구해 왔습니다. 공유된 목표 하에 여러 양식을 융합하는 모델들은 단일 개체 유형에 국한된 반면, 여러 개체 유형을 포괄하는 모델들은 명시적인 구조 모델링을 생략하거나, 모델이 읽을 수 있는 양식 자체를 네이티브로 생성할 수 없는 어댑터 기반 설계에 의존합니다. BioMatrix는 분자 서열(SMILES 및 SELFIES 표기법 모두 지원), 분자 구조, 단백질 서열, 단백질 구조 및 자연어를 통합된 토큰화 방식을 통해 공유된 이산 토큰 공간으로 매핑함으로써 이러한 격차를 해소합니다. 이에 따라 모든 양식은 외부 인코더, 투영 어댑터 또는 양식별 출력 헤드 없이 단일의 다음 토큰 예측 목표 하에서 균일하게 소비되고 생성됩니다. Qwen3 언어 모델(1.7B 및 4B)을 기반으로 구축된 BioMatrix는 일반 및 도메인 특화 텍스트, 분자와 단백질의 서열 및 구조 관점, 그리고 생체 분자 개체를 과학 텍스트와 교차시키고 분자-단백질 및 단백질-단백질 상호작용 데이터를 통해 개별 개체를 연결하는 교차 양식 말뭉치에 걸쳐 3,044억 개의 토큰으로 지속적인 사전 학습을 수행합니다. 단일 개체 및 다중 개체 이해와 생성 작업을 양식 간 및 양식 내에서 포괄하는 6개 범주에 걸친 80개 작업을 포함하는 포괄적인 하위 응용 작업 세트에서 조정된 후, BioMatrix는 80개 작업 중 77개에서 최고 수준 또는 경쟁력 있는 성능을 달성합니다. 이는 단일한 네이티브 멀티모달 일반주의 모델이 광범위한 생물학적 작업에서 전문적인 접근 방식과 효과적으로 대등하거나 이를 능가할 수 있음을 보여줍니다.

English

We present BioMatrix, the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage separately: those that fuse multiple modalities under a shared objective remain confined to a single entity type, while those spanning multiple entity types either omit explicit structural modeling or rely on adapter-based designs in which the model cannot natively generate the very modalities it can read. BioMatrix closes this gap by mapping molecular sequences (supporting both SMILES and SELFIES notations), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space through a unified tokenization scheme, so that all modalities are consumed and produced uniformly under a single next-token prediction objective -- without external encoders, projection adapters, or modality-specific output heads. Built upon the Qwen3 language model (1.7B and 4B), BioMatrix is continually pretrained on 304.4 billion tokens spanning general and domain-specific text, sequence and structure views of molecules and proteins, and cross-modal corpora that interleave biomolecular entities with scientific text and link distinct entities through molecule-protein and protein-protein interaction data. After tuning on a comprehensive suite of downstream applications covering 80 tasks across 6 categories -- encompassing single-entity and multi-entity understanding and generation tasks across and within modalities -- BioMatrix achieves state-of-the-art or competitive performance on 77 out of 80 tasks, demonstrating that a single, natively multimodal generalist model can effectively match or surpass specialized approaches across a wide range of biological tasks.