SimpleFold: 단백질 접힘은 생각보다 간단합니다

초록

단백질 접힘 모델은 일반적으로 도메인 지식을 아키텍처 블록과 학습 파이프라인에 통합함으로써 획기적인 결과를 달성해 왔습니다. 그러나 관련된 다양한 문제들에서 생성 모델의 성공을 고려할 때, 이러한 아키텍처 설계가 고성능 모델을 구축하기 위한 필수 조건인지에 대한 의문을 제기하는 것은 자연스러운 일입니다. 본 논문에서는 일반적인 목적의 트랜스포머 블록만을 사용하는 최초의 플로우 매칭 기반 단백질 접힘 모델인 SimpleFold을 소개합니다. 단백질 접힘 모델은 일반적으로 삼각형 업데이트, 명시적 쌍 표현, 또는 이 특정 도메인을 위해 설계된 다중 학습 목표와 같은 계산 비용이 많이 드는 모듈을 사용합니다. 반면, SimpleFold은 적응형 레이어를 갖춘 표준 트랜스포머 블록을 사용하며, 추가적인 구조적 항목이 포함된 생성적 플로우 매칭 목표를 통해 학습됩니다. 우리는 SimpleFold을 30억 개의 파라미터로 확장하고 약 900만 개의 정제된 단백질 구조와 실험적 PDB 데이터를 사용하여 학습시켰습니다. 표준 접힘 벤치마크에서 SimpleFold-3B는 최첨단 기준선과 비교하여 경쟁력 있는 성능을 달성했으며, 결정론적 재구성 목표를 통해 학습된 모델들이 일반적으로 어려워하는 앙상블 예측에서도 강력한 성능을 보여주었습니다. 일반적인 목적의 아키텍처 덕분에 SimpleFold은 소비자 수준의 하드웨어에서의 배포와 추론에서 효율성을 보여줍니다. SimpleFold은 단백질 접힘에서 복잡한 도메인 특화 아키텍처 설계에 대한 의존성에 도전하며, 미래의 진보를 위한 대안적인 설계 공간을 열어줍니다.

English

Protein folding models have achieved groundbreaking results typically via a combination of integrating domain knowledge into the architectural blocks and training pipelines. Nonetheless, given the success of generative models across different but related problems, it is natural to question whether these architectural designs are a necessary condition to build performant models. In this paper, we introduce SimpleFold, the first flow-matching based protein folding model that solely uses general purpose transformer blocks. Protein folding models typically employ computationally expensive modules involving triangular updates, explicit pair representations or multiple training objectives curated for this specific domain. Instead, SimpleFold employs standard transformer blocks with adaptive layers and is trained via a generative flow-matching objective with an additional structural term. We scale SimpleFold to 3B parameters and train it on approximately 9M distilled protein structures together with experimental PDB data. On standard folding benchmarks, SimpleFold-3B achieves competitive performance compared to state-of-the-art baselines, in addition SimpleFold demonstrates strong performance in ensemble prediction which is typically difficult for models trained via deterministic reconstruction objectives. Due to its general-purpose architecture, SimpleFold shows efficiency in deployment and inference on consumer-level hardware. SimpleFold challenges the reliance on complex domain-specific architectures designs in protein folding, opening up an alternative design space for future progress.

SimpleFold: 단백질 접힘은 생각보다 간단합니다

SimpleFold: Folding Proteins is Simpler than You Think

초록

Support