OpenProteinSet：大规模结构生物学训练数据

摘要

蛋白质的多重序列比对（MSAs）编码丰富的生物信息，在蛋白质设计和蛋白质结构预测等生物信息学方法中几十年来一直发挥着重要作用。像AlphaFold2这样利用变压器直接关注大量原始MSAs的最新突破再次证实了它们的重要性。然而，MSAs的生成需要极大的计算量，目前尚未向研究社区提供类似用于训练AlphaFold2的数据集，这阻碍了蛋白质机器学习领域的进展。为解决这一问题，我们介绍了OpenProteinSet，这是一个开源语料库，包含超过1600万个MSAs，与蛋白质数据银行中的结构同源物和AlphaFold2蛋白质结构预测相关联。我们先前已经展示了OpenProteinSet的实用性，成功地在其上对AlphaFold2进行了重新训练。我们期望OpenProteinSet能广泛用于蛋白质结构、功能和设计等多样任务的训练和验证数据，以及大规模多模态机器学习研究。

English

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

OpenProteinSet：大规模结构生物学训练数据

OpenProteinSet: Training data for structural biology at scale

摘要

Support