ChatPaper.aiChatPaper

高精度單細胞轉錄組學分析與生成的多模態語言建模

Multimodal Language Modeling for High-Accuracy Single Cell Transcriptomics Analysis and Generation

March 12, 2025
作者: Yaorui Shi, Jiaqi Yang, Sihang Li, Junfeng Fang, Xiang Wang, Zhiyuan Liu, Yang Zhang
cs.AI

摘要

預訓練語言模型(PLMs)已革新了科學研究,但其在單細胞分析中的應用仍顯局限。文本PLMs無法處理單細胞RNA測序數據,而細胞PLMs則缺乏處理自由文本的能力,這限制了它們在多模態任務中的應用。現有嘗試橋接這些模態的努力往往面臨信息丟失或單模態預訓練不足的問題,導致性能欠佳。為應對這些挑戰,我們提出了單細胞多模態生成預訓練轉換器(scMMGPT),這是一個用於聯合細胞與文本建模的統一PLM。scMMGPT有效地整合了最先進的細胞與文本PLMs,促進了跨模態知識共享,從而提升性能。為彌合文本與細胞模態間的鴻溝,scMMGPT採用了專用的跨模態投影器,並在2700萬個細胞上進行了大規模預訓練——這是迄今為止多模態細胞-文本PLMs的最大數據集。此大規模預訓練使scMMGPT在聯合細胞-文本任務中表現卓越,在細胞描述生成的文本差異上實現了84%的相對提升,細胞類型註釋的準確率提高了20.5%,文本條件下的偽細胞生成的k-NN準確率提升了4%,均優於基準模型。
English
Pre-trained language models (PLMs) have revolutionized scientific research, yet their application to single-cell analysis remains limited. Text PLMs cannot process single-cell RNA sequencing data, while cell PLMs lack the ability to handle free text, restricting their use in multimodal tasks. Existing efforts to bridge these modalities often suffer from information loss or inadequate single-modal pre-training, leading to suboptimal performances. To address these challenges, we propose Single-Cell MultiModal Generative Pre-trained Transformer (scMMGPT), a unified PLM for joint cell and text modeling. scMMGPT effectively integrates the state-of-the-art cell and text PLMs, facilitating cross-modal knowledge sharing for improved performance. To bridge the text-cell modality gap, scMMGPT leverages dedicated cross-modal projectors, and undergoes extensive pre-training on 27 million cells -- the largest dataset for multimodal cell-text PLMs to date. This large-scale pre-training enables scMMGPT to excel in joint cell-text tasks, achieving an 84\% relative improvement of textual discrepancy for cell description generation, 20.5\% higher accuracy for cell type annotation, and 4\% improvement in k-NN accuracy for text-conditioned pseudo-cell generation, outperforming baselines.

Summary

AI-Generated Summary

PDF42March 13, 2025