OpenProteinSet: 구조 생물학을 위한 대규모 학습 데이터

초록

단백질의 다중 서열 정렬(MSA)은 풍부한 생물학적 정보를 담고 있으며, 수십 년 동안 단백질 설계 및 단백질 구조 예측과 같은 생물정보학적 과제에서 핵심적인 역할을 해왔습니다. 최근 AlphaFold2와 같은 혁신적인 연구에서 대량의 원시 MSA 데이터에 직접 주의를 기울이는 트랜스포머(transformer)를 활용함으로써 MSA의 중요성이 다시 한번 강조되었습니다. 그러나 MSA 생성은 매우 높은 계산 자원을 요구하며, AlphaFold2를 훈련하는 데 사용된 것과 동등한 규모의 데이터셋이 연구 커뮤니티에 공개되지 않아 단백질 관련 머신러닝 연구의 진전이 지연되고 있습니다. 이러한 문제를 해결하기 위해, 우리는 1,600만 개 이상의 MSA, Protein Data Bank에서 추출한 관련 구조적 동족체, 그리고 AlphaFold2 단백질 구조 예측을 포함한 오픈소스 코퍼스인 OpenProteinSet을 소개합니다. 우리는 이미 OpenProteinSet을 사용하여 AlphaFold2를 성공적으로 재훈련함으로써 그 유용성을 입증했습니다. OpenProteinSet은 1) 단백질 구조, 기능, 설계에 초점을 맞춘 다양한 과제와 2) 대규모 멀티모달 머신러닝 연구를 위한 훈련 및 검증 데이터로 광범위하게 활용될 것으로 기대됩니다.

English

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

OpenProteinSet: 구조 생물학을 위한 대규모 학습 데이터

OpenProteinSet: Training data for structural biology at scale

초록

Support