基於直通式指導的Gumbel-Softmax流匹配用於可控生物序列生成
Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation
March 21, 2025
作者: Sophia Tang, Yinuo Zhang, Alexander Tong, Pranam Chatterjee
cs.AI
摘要
在連續單純形中的流匹配已成為DNA序列設計的一種有前景的策略,但在擴展到肽和蛋白質生成所需的高維單純形時面臨挑戰。我們引入了Gumbel-Softmax流匹配和分數匹配,這是一個基於新型Gumbel-Softmax插值(具有時間依賴性溫度)的單純形生成框架。利用這一插值,我們通過推導出一個參數化的速度場來實現Gumbel-Softmax流匹配,該速度場將從平滑的分類分佈傳輸到集中在單純形單個頂點的分佈。我們還提出了Gumbel-Softmax分數匹配,該方法學習回歸概率密度的梯度。我們的框架支持高質量、多樣化的生成,並能高效擴展到更高維的單純形。為了實現無訓練的引導,我們提出了直通引導流(STGFlow),這是一種基於分類器的引導方法,利用直通估計器將無條件速度場引導至單純形的最優頂點。STGFlow能夠使用在乾淨序列上預訓練的分類器進行高效的推理時引導,並可與任何離散流方法結合使用。這些組件共同構成了一個用於可控從頭序列生成的強大框架。我們在條件性DNA啟動子設計、僅基於序列的蛋白質生成以及用於罕見疾病治療的靶向結合肽設計中展示了最先進的性能。
English
Flow matching in the continuous simplex has emerged as a promising strategy
for DNA sequence design, but struggles to scale to higher simplex dimensions
required for peptide and protein generation. We introduce Gumbel-Softmax Flow
and Score Matching, a generative framework on the simplex based on a novel
Gumbel-Softmax interpolant with a time-dependent temperature. Using this
interpolant, we introduce Gumbel-Softmax Flow Matching by deriving a
parameterized velocity field that transports from smooth categorical
distributions to distributions concentrated at a single vertex of the simplex.
We alternatively present Gumbel-Softmax Score Matching which learns to regress
the gradient of the probability density. Our framework enables high-quality,
diverse generation and scales efficiently to higher-dimensional simplices. To
enable training-free guidance, we propose Straight-Through Guided Flows
(STGFlow), a classifier-based guidance method that leverages straight-through
estimators to steer the unconditional velocity field toward optimal vertices of
the simplex. STGFlow enables efficient inference-time guidance using
classifiers pre-trained on clean sequences, and can be used with any discrete
flow method. Together, these components form a robust framework for
controllable de novo sequence generation. We demonstrate state-of-the-art
performance in conditional DNA promoter design, sequence-only protein
generation, and target-binding peptide design for rare disease treatment.Summary
AI-Generated Summary