Prot2Token:基于下一标记预测的蛋白质建模统一框架
Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction
May 26, 2025
作者: Mahdi Pourmirzaei, Farzaneh Esmaili, Salhuldin Alqarghuli, Mohammadreza Pourmirzaei, Ye Han, Kai Chen, Mohsen Rezaei, Duolin Wang, Dong Xu
cs.AI
摘要
蛋白质预测任务的多样性传统上要求使用专门的模型,这阻碍了开发广泛适用且计算高效的蛋白质语言模型(PLMs)。在本研究中,我们提出了Prot2Token,一个统一的框架,通过将广泛的蛋白质相关预测——从序列级属性、残基特定特征到复杂的蛋白质间相互作用——转化为标准化的下一令牌预测格式,从而克服了这些挑战。Prot2Token的核心在于采用了一个自回归解码器,该解码器以预训练蛋白质编码器的嵌入为条件,并通过可学习的任务令牌进行指导,以执行多样化的预测。这种架构独特地促进了多任务学习,使单一模型能够高效掌握众多任务。我们在一系列基准测试中进行了广泛的实验验证,展示了Prot2Token在不同类型蛋白质预测任务中的强大预测能力。关键成果包括显著的加速(例如,相较于AlphaFold2与MSA的近1000倍)以及性能往往匹配或超越专门方法。此外,我们引入了一种辅助的自监督解码器预训练方法,以提升空间敏感任务的表现。因此,Prot2Token为蛋白质建模提供了一个多功能、高通量的范式,有望加速生物学发现和新疗法的开发。代码可在https://github.com/mahdip72/prot2token 获取。
English
The diverse nature of protein prediction tasks has traditionally necessitated
specialized models, hindering the development of broadly applicable and
computationally efficient Protein Language Models (PLMs). In this work, we
introduce Prot2Token, a unified framework that overcomes these challenges by
converting a wide spectrum of protein-related predictions, from sequence-level
properties and residue-specific attributes to complex inter-protein
interactions, into a standardized next-token prediction format. At its core,
Prot2Token employs an autoregressive decoder, conditioned on embeddings from
pre-trained protein encoders and guided by learnable task tokens, to perform
diverse predictions. This architecture uniquely facilitates multi-task
learning, enabling a single model to master numerous tasks with improved
efficiency. We present extensive experimental validation across a variety of
benchmarks, demonstrating Prot2Tokens strong predictive power in different
types of protein-prediction tasks. Key results include significant speedups
(e.g., near 1000x over AlphaFold2 with MSA) and performance often matching or
exceeding specialized approaches. Beyond that, we introduce an auxiliary
self-supervised decoder pre-training approach to improve spatially sensitive
task performance. Prot2Token thus offers a significant step towards a
versatile, high-throughput paradigm for protein modeling, promising to
accelerate biological discovery and the development of novel therapeutics. The
code is available at https://github.com/mahdip72/prot2token .Summary
AI-Generated Summary