OpenProteinSet：大规模结构生物学训练数据

摘要

蛋白質的多重序列對齊（MSAs）編碼豐富的生物信息，數十年來一直是生物信息學方法中的重要工具，用於蛋白質設計和蛋白質結構預測等任務。像AlphaFold2這樣利用變壓器直接關注大量原始MSAs的最新突破再次證實了它們的重要性。然而，MSAs的生成具有高度的計算密集性，並且目前尚未向研究社區提供與AlphaFold2訓練使用的數據集相媲美的數據，這阻礙了蛋白質機器學習的進展。為解決這個問題，我們介紹了OpenProteinSet，這是一個開源語料庫，包含超過1600萬個MSAs、與蛋白質數據庫中的結構同源物以及AlphaFold2蛋白質結構預測相關聯。我們先前已成功地通過OpenProteinSet對AlphaFold2進行了重新訓練，證明了OpenProteinSet的實用性。我們期望OpenProteinSet將廣泛應用於以下方面：1）用於蛋白質結構、功能和設計等多樣任務的訓練和驗證數據；2）用於大規模多模態機器學習研究。

English

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

OpenProteinSet：大规模结构生物学训练数据

OpenProteinSet: Training data for structural biology at scale

摘要

Support