Prot2Token：基于下一标记预测的蛋白质建模统一框架

摘要

蛋白质预测任务的多样性传统上要求使用专门的模型，这阻碍了开发广泛适用且计算高效的蛋白质语言模型（PLMs）。在本研究中，我们提出了Prot2Token，一个统一的框架，通过将广泛的蛋白质相关预测——从序列级属性、残基特定特征到复杂的蛋白质间相互作用——转化为标准化的下一令牌预测格式，从而克服了这些挑战。Prot2Token的核心在于采用了一个自回归解码器，该解码器以预训练蛋白质编码器的嵌入为条件，并通过可学习的任务令牌进行指导，以执行多样化的预测。这种架构独特地促进了多任务学习，使单一模型能够高效掌握众多任务。我们在一系列基准测试中进行了广泛的实验验证，展示了Prot2Token在不同类型蛋白质预测任务中的强大预测能力。关键成果包括显著的加速（例如，相较于AlphaFold2与MSA的近1000倍）以及性能往往匹配或超越专门方法。此外，我们引入了一种辅助的自监督解码器预训练方法，以提升空间敏感任务的表现。因此，Prot2Token为蛋白质建模提供了一个多功能、高通量的范式，有望加速生物学发现和新疗法的开发。代码可在https://github.com/mahdip72/prot2token 获取。

English

The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions, from sequence-level properties and residue-specific attributes to complex inter-protein interactions, into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling a single model to master numerous tasks with improved efficiency. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks. Key results include significant speedups (e.g., near 1000x over AlphaFold2 with MSA) and performance often matching or exceeding specialized approaches. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a significant step towards a versatile, high-throughput paradigm for protein modeling, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .

Prot2Token：基于下一标记预测的蛋白质建模统一框架

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

摘要

Support