Prot2Token: 次トークン予測によるタンパク質モデリングのための統合フレームワーク

要旨

タンパク質予測タスクの多様性は、従来、専門化されたモデルを必要とし、広く適用可能で計算効率の良いタンパク質言語モデル（PLM）の開発を妨げてきました。本研究では、Prot2Tokenを紹介します。これは、配列レベルの特性や残基固有の属性から複雑なタンパク質間相互作用まで、幅広いタンパク質関連予測を標準化された次トークン予測形式に変換することで、これらの課題を克服する統一フレームワークです。Prot2Tokenの中核には、事前学習されたタンパク質エンコーダからの埋め込みと学習可能なタスクトークンに基づいて、多様な予測を行う自己回帰デコーダが採用されています。このアーキテクチャは、マルチタスク学習を独特に促進し、単一のモデルが多数のタスクを効率的に習得することを可能にします。さまざまなベンチマークでの広範な実験的検証を通じて、Prot2Tokenが異なるタイプのタンパク質予測タスクにおいて強力な予測力を発揮することを示します。主な結果には、大幅な高速化（例：AlphaFold2 with MSAに対して約1000倍）や、専門化されたアプローチに匹敵またはそれを上回る性能が含まれます。さらに、空間的に敏感なタスクの性能を向上させるための補助的な自己教師付きデコーダ事前学習アプローチを導入します。Prot2Tokenは、タンパク質モデリングのための汎用的で高スループットなパラダイムに向けた重要な一歩を提供し、生物学の発見や新規治療法の開発を加速することを約束します。コードはhttps://github.com/mahdip72/prot2tokenで利用可能です。

English

The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions, from sequence-level properties and residue-specific attributes to complex inter-protein interactions, into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling a single model to master numerous tasks with improved efficiency. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks. Key results include significant speedups (e.g., near 1000x over AlphaFold2 with MSA) and performance often matching or exceeding specialized approaches. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a significant step towards a versatile, high-throughput paradigm for protein modeling, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .

Prot2Token: 次トークン予測によるタンパク質モデリングのための統合フレームワーク

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

要旨

Support