蛋白质自回归建模:多尺度结构生成方法
Protein Autoregressive Modeling via Multiscale Structure Generation
February 4, 2026
作者: Yanru Qu, Cheng-Yen Hsieh, Zaixiang Zheng, Ge Liu, Quanquan Gu
cs.AI
摘要
我们提出了蛋白质自回归建模(PAR),这是首个通过从粗到精的跨尺度预测来实现蛋白质主链生成的多尺度自回归框架。PAR利用蛋白质的层级化特性,通过模拟雕塑过程——先构建粗粒度拓扑再逐级细化结构细节——来生成结构。该框架包含三个核心组件:(i)多尺度下采样操作,在训练过程中表征不同尺度的蛋白质结构;(ii)自回归变换器,负责编码多尺度信息并生成指导结构生成的条件嵌入;(iii)基于流模型的主链解码器,根据条件嵌入生成主链原子。针对自回归模型因训练与生成过程不匹配而存在的暴露偏差问题,我们通过噪声上下文学习与计划采样策略有效缓解了其对结构生成质量的影响。值得注意的是,PAR展现出强大的零样本泛化能力,支持无需微调即可实现灵活的人工提示条件生成及基序支架构建。在无条件生成基准测试中,PAR不仅高效学习蛋白质分布并生成具有高设计质量的主链,还展现出优异的尺度扩展特性。这些优势共同确立了PAR作为蛋白质结构生成领域的突破性框架。
English
We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.