DPLM-2:一种多模态扩散蛋白语言模型
DPLM-2: A Multimodal Diffusion Protein Language Model
October 17, 2024
作者: Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, Quanquan Gu
cs.AI
摘要
蛋白质是由其氨基酸序列定义的基本大分子,这些序列决定了它们的三维结构,进而决定了在所有生物体中的功能。因此,生成式蛋白建模需要采用一种多模态方法,同时对序列和结构进行建模、理解和生成。然而,现有方法通常使用各自的模型来处理每种模态,从而限制了它们捕捉序列和结构之间复杂关系的能力。这导致在需要联合理解和生成两种模态的任务中性能不佳。在本文中,我们介绍了DPLM-2,这是一种多模态蛋白基础模型,它将离散扩散蛋白语言模型(DPLM)扩展到适应序列和结构。为了让语言模型学习结构,我们使用基于量化的无查找标记化分词器将3D坐标转换为离散标记。通过在实验和高质量合成结构上进行训练,DPLM-2学习了序列和结构的联合分布,以及它们的边缘分布和条件分布。我们还实现了一种有效的预热策略,以利用大规模进化数据与预训练基于序列的蛋白语言模型的结构归纳偏好之间的联系。实证评估表明,DPLM-2能够同时生成高度兼容的氨基酸序列及其对应的3D结构,无需两阶段生成方法。此外,DPLM-2在各种条件生成任务中表现出竞争性能,包括折叠、逆向折叠和使用多模态基序进行支架搭建,同时为预测任务提供了结构感知表示。
English
Proteins are essential macromolecules defined by their amino acid sequences,
which determine their three-dimensional structures and, consequently, their
functions in all living organisms. Therefore, generative protein modeling
necessitates a multimodal approach to simultaneously model, understand, and
generate both sequences and structures. However, existing methods typically use
separate models for each modality, limiting their ability to capture the
intricate relationships between sequence and structure. This results in
suboptimal performance in tasks that requires joint understanding and
generation of both modalities. In this paper, we introduce DPLM-2, a multimodal
protein foundation model that extends discrete diffusion protein language model
(DPLM) to accommodate both sequences and structures. To enable structural
learning with the language model, 3D coordinates are converted to discrete
tokens using a lookup-free quantization-based tokenizer. By training on both
experimental and high-quality synthetic structures, DPLM-2 learns the joint
distribution of sequence and structure, as well as their marginals and
conditionals. We also implement an efficient warm-up strategy to exploit the
connection between large-scale evolutionary data and structural inductive
biases from pre-trained sequence-based protein language models. Empirical
evaluation shows that DPLM-2 can simultaneously generate highly compatible
amino acid sequences and their corresponding 3D structures eliminating the need
for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive
performance in various conditional generation tasks, including folding, inverse
folding, and scaffolding with multimodal motif inputs, as well as providing
structure-aware representations for predictive tasks.Summary
AI-Generated Summary