基于直通式引导的Gumbel-Softmax流匹配技术用于可控生物序列生成
Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation
March 21, 2025
作者: Sophia Tang, Yinuo Zhang, Alexander Tong, Pranam Chatterjee
cs.AI
摘要
在连续单纯形中进行流匹配已成为DNA序列设计的一种有前景的策略,但在扩展到肽和蛋白质生成所需的高维单纯形时面临挑战。我们引入了基于新颖Gumbel-Softmax插值(具有时间依赖性温度)的生成框架——Gumbel-Softmax流匹配与得分匹配。利用这一插值,我们通过推导参数化速度场,实现了从平滑分类分布到集中于单纯形单一顶点的分布的传输,从而提出了Gumbel-Softmax流匹配。此外,我们还提出了Gumbel-Softmax得分匹配,该方法学习回归概率密度的梯度。我们的框架支持高质量、多样化的生成,并能高效扩展至高维单纯形。为实现无需训练的引导,我们提出了直通引导流(STGFlow),这是一种基于分类器的引导方法,利用直通估计器将无条件速度场导向单纯形的最优顶点。STGFlow能够利用预训练于纯净序列上的分类器进行高效的推理时引导,并可与任何离散流方法结合使用。这些组件共同构成了一个强大的可控从头序列生成框架。我们在条件性DNA启动子设计、仅基于序列的蛋白质生成以及针对罕见病治疗的靶向结合肽设计中展示了最先进的性能。
English
Flow matching in the continuous simplex has emerged as a promising strategy
for DNA sequence design, but struggles to scale to higher simplex dimensions
required for peptide and protein generation. We introduce Gumbel-Softmax Flow
and Score Matching, a generative framework on the simplex based on a novel
Gumbel-Softmax interpolant with a time-dependent temperature. Using this
interpolant, we introduce Gumbel-Softmax Flow Matching by deriving a
parameterized velocity field that transports from smooth categorical
distributions to distributions concentrated at a single vertex of the simplex.
We alternatively present Gumbel-Softmax Score Matching which learns to regress
the gradient of the probability density. Our framework enables high-quality,
diverse generation and scales efficiently to higher-dimensional simplices. To
enable training-free guidance, we propose Straight-Through Guided Flows
(STGFlow), a classifier-based guidance method that leverages straight-through
estimators to steer the unconditional velocity field toward optimal vertices of
the simplex. STGFlow enables efficient inference-time guidance using
classifiers pre-trained on clean sequences, and can be used with any discrete
flow method. Together, these components form a robust framework for
controllable de novo sequence generation. We demonstrate state-of-the-art
performance in conditional DNA promoter design, sequence-only protein
generation, and target-binding peptide design for rare disease treatment.Summary
AI-Generated Summary