SimpleFold:蛋白質折疊比你想像的更簡單
SimpleFold: Folding Proteins is Simpler than You Think
September 23, 2025
作者: Yuyang Wang, Jiarui Lu, Navdeep Jaitly, Josh Susskind, Miguel Angel Bautista
cs.AI
摘要
蛋白質折疊模型通常通過將領域知識整合到架構模塊和訓練流程中,取得了突破性的成果。然而,考慮到生成模型在不同但相關問題上的成功,我們自然會質疑這些架構設計是否是構建高性能模型的必要條件。在本文中,我們介紹了SimpleFold,這是第一個基於流匹配的蛋白質折疊模型,僅使用通用的Transformer模塊。蛋白質折疊模型通常採用計算成本高昂的模塊,涉及三角更新、顯式配對表示或為此特定領域定制的多個訓練目標。相反,SimpleFold採用帶有自適應層的標準Transformer模塊,並通過生成流匹配目標和額外的結構項進行訓練。我們將SimpleFold擴展到30億參數,並在大約900萬個蒸餾蛋白質結構和實驗PDB數據上進行訓練。在標準折疊基準測試中,SimpleFold-3B與最先進的基線模型相比,表現出競爭力,此外,SimpleFold在集成預測中表現出色,這對於通過確定性重建目標訓練的模型來說通常很困難。由於其通用架構,SimpleFold在消費級硬件上的部署和推理表現出高效性。SimpleFold挑戰了蛋白質折疊中對複雜領域特定架構設計的依賴,為未來的進展開闢了另一種設計空間。
English
Protein folding models have achieved groundbreaking results typically via a
combination of integrating domain knowledge into the architectural blocks and
training pipelines. Nonetheless, given the success of generative models across
different but related problems, it is natural to question whether these
architectural designs are a necessary condition to build performant models. In
this paper, we introduce SimpleFold, the first flow-matching based protein
folding model that solely uses general purpose transformer blocks. Protein
folding models typically employ computationally expensive modules involving
triangular updates, explicit pair representations or multiple training
objectives curated for this specific domain. Instead, SimpleFold employs
standard transformer blocks with adaptive layers and is trained via a
generative flow-matching objective with an additional structural term. We scale
SimpleFold to 3B parameters and train it on approximately 9M distilled protein
structures together with experimental PDB data. On standard folding benchmarks,
SimpleFold-3B achieves competitive performance compared to state-of-the-art
baselines, in addition SimpleFold demonstrates strong performance in ensemble
prediction which is typically difficult for models trained via deterministic
reconstruction objectives. Due to its general-purpose architecture, SimpleFold
shows efficiency in deployment and inference on consumer-level hardware.
SimpleFold challenges the reliance on complex domain-specific architectures
designs in protein folding, opening up an alternative design space for future
progress.