Prot2Token：基於下一個標記預測的統一蛋白質建模框架

摘要

蛋白質預測任務的多樣性傳統上需要專門的模型，這阻礙了廣泛適用且計算高效的蛋白質語言模型（PLMs）的發展。在本研究中，我們引入了Prot2Token，這是一個統一框架，通過將從序列級特性、殘基特定屬性到複雜的蛋白質間相互作用等多種蛋白質相關預測轉化為標準化的下一個標記預測格式，克服了這些挑戰。Prot2Token的核心採用了一個自迴歸解碼器，該解碼器基於預訓練蛋白質編碼器的嵌入，並由可學習的任務標記引導，以執行多樣化的預測。這種架構獨特地促進了多任務學習，使單一模型能夠以更高的效率掌握多項任務。我們在各種基準上進行了廣泛的實驗驗證，展示了Prot2Token在不同類型蛋白質預測任務中的強大預測能力。關鍵結果包括顯著的加速（例如，相比AlphaFold2與MSA的近1000倍）以及性能往往匹配或超越專門方法。此外，我們引入了一種輔助的自監督解碼器預訓練方法，以提高空間敏感任務的表現。因此，Prot2Token為蛋白質建模提供了一個多功能、高通量的範式，有望加速生物學發現和新療法的開發。代碼可在https://github.com/mahdip72/prot2token 獲取。

English

The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions, from sequence-level properties and residue-specific attributes to complex inter-protein interactions, into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling a single model to master numerous tasks with improved efficiency. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks. Key results include significant speedups (e.g., near 1000x over AlphaFold2 with MSA) and performance often matching or exceeding specialized approaches. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a significant step towards a versatile, high-throughput paradigm for protein modeling, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .

Prot2Token：基於下一個標記預測的統一蛋白質建模框架

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

摘要

Support