DPLM-2: マルチモーダル拡散タンパク質言語モデル

要旨

タンパク質は、アミノ酸配列によって定義される必須の大規模分子であり、これによってその三次元構造が決定され、それによってすべての生物における機能が決まります。したがって、生成的タンパク質モデリングには、同時に配列と構造の両方をモデル化し理解し生成するための多様なアプローチが必要です。しかし、既存の方法では通常、各モダリティに対して別々のモデルを使用するため、配列と構造の複雑な関係を捉える能力が制限されます。これにより、両方のモダリティの共同理解と生成を必要とするタスクにおいて、最適でないパフォーマンスが生じます。本論文では、DPLM-2という、配列と構造の両方を収容する多様なタンパク質基盤モデルを紹介します。言語モデルと構造学習を可能にするために、3D座標はルックアップフリーの量子化ベースのトークナイザを使用して離散トークンに変換されます。実験的および高品質な合成構造の両方でトレーニングを行うことで、DPLM-2は配列と構造の共同分布、およびそれらの周辺と条件付きを学習します。また、大規模な進化データと事前にトレーニングされた配列ベースのタンパク質言語モデルからの構造的帰納バイアスとの接続を活用するための効率的なウォームアップ戦略を実装します。経験的評価により、DPLM-2は高度に互換性のあるアミノ酸配列とそれに対応する3D構造を同時に生成でき、2段階の生成アプローチを必要としなくなります。さらに、DPLM-2は、折りたたみ、逆折りたたみ、および多様なモチーフ入力を用いたスキャッフォールディングを含むさまざまな条件付き生成タスクにおいて競争力のあるパフォーマンスを示し、予測タスクのための構造に注意した表現を提供します。

English

Proteins are essential macromolecules defined by their amino acid sequences, which determine their three-dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup-free quantization-based tokenizer. By training on both experimental and high-quality synthetic structures, DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm-up strategy to exploit the connection between large-scale evolutionary data and structural inductive biases from pre-trained sequence-based protein language models. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure-aware representations for predictive tasks.

DPLM-2: マルチモーダル拡散タンパク質言語モデル

DPLM-2: A Multimodal Diffusion Protein Language Model

要旨

Support