DPLM-2：一种多模态扩散蛋白语言模型

摘要

蛋白质是由其氨基酸序列定义的基本大分子，这些序列决定了它们的三维结构，进而决定了在所有生物体中的功能。因此，生成式蛋白建模需要采用一种多模态方法，同时对序列和结构进行建模、理解和生成。然而，现有方法通常使用各自的模型来处理每种模态，从而限制了它们捕捉序列和结构之间复杂关系的能力。这导致在需要联合理解和生成两种模态的任务中性能不佳。在本文中，我们介绍了DPLM-2，这是一种多模态蛋白基础模型，它将离散扩散蛋白语言模型（DPLM）扩展到适应序列和结构。为了让语言模型学习结构，我们使用基于量化的无查找标记化分词器将3D坐标转换为离散标记。通过在实验和高质量合成结构上进行训练，DPLM-2学习了序列和结构的联合分布，以及它们的边缘分布和条件分布。我们还实现了一种有效的预热策略，以利用大规模进化数据与预训练基于序列的蛋白语言模型的结构归纳偏好之间的联系。实证评估表明，DPLM-2能够同时生成高度兼容的氨基酸序列及其对应的3D结构，无需两阶段生成方法。此外，DPLM-2在各种条件生成任务中表现出竞争性能，包括折叠、逆向折叠和使用多模态基序进行支架搭建，同时为预测任务提供了结构感知表示。

English

Proteins are essential macromolecules defined by their amino acid sequences, which determine their three-dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup-free quantization-based tokenizer. By training on both experimental and high-quality synthetic structures, DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm-up strategy to exploit the connection between large-scale evolutionary data and structural inductive biases from pre-trained sequence-based protein language models. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure-aware representations for predictive tasks.

DPLM-2：一种多模态扩散蛋白语言模型

DPLM-2: A Multimodal Diffusion Protein Language Model

摘要

Support