基于多尺度结构生成的蛋白质自回归建模
Protein Autoregressive Modeling via Multiscale Structure Generation
February 4, 2026
作者: Yanru Qu, Cheng-Yen Hsieh, Zaixiang Zheng, Ge Liu, Quanquan Gu
cs.AI
摘要
我们提出蛋白质自回归建模(PAR),这是首个通过从粗到细的跨尺度预测来实现蛋白质骨架生成的多尺度自回归框架。PAR利用蛋白质的层级特性,通过模拟雕塑过程——先构建粗略拓扑再逐级细化结构细节——来生成结构。该框架包含三个核心组件:(i)多尺度下采样操作,在训练过程中表征多尺度蛋白质结构;(ii)自回归变换器,编码多尺度信息并生成条件嵌入以指导结构生成;(iii)基于流的骨架解码器,根据这些嵌入生成骨架原子。此外,自回归模型存在训练与生成过程不匹配导致的暴露偏差问题,会严重降低结构生成质量。我们通过采用噪声上下文学习和计划采样策略有效缓解该问题,实现了稳健的骨架生成。值得注意的是,PAR展现出强大的零样本泛化能力,支持无需微调即可实现灵活的人工提示条件生成和基序支架构建。在无条件生成基准测试中,PAR能有效学习蛋白质分布,生成具有高设计质量的骨架,并展现出良好的缩放特性。这些特性共同确立了PAR作为蛋白质结构生成框架的显著优势。
English
We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.