TR2-D2:基於樹搜索引導的軌跡感知微調用於離散擴散
TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion
September 29, 2025
作者: Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee
cs.AI
摘要
結合隨機最優控制的強化學習為擴散微調提供了一個極具前景的框架,其中預訓練的擴散模型被優化以生成導向獎勵傾斜分佈的路徑。雖然這些方法能夠在無需從最優分佈中獲取顯式樣本的情況下進行優化,但它們需要在當前微調模型下對軌跡進行訓練,這使得它們容易強化那些產生低獎勵的次優軌跡。為克服這一挑戰,我們引入了基於樹搜索引導的軌跡感知微調框架(TR2-D2),這是一種新穎的框架,它利用樹搜索來優化獎勵引導的離散擴散軌跡,從而構建用於軌跡感知微調的回放緩衝區。這些緩衝區是通過蒙特卡羅樹搜索(MCTS)生成的,隨後用於在隨機最優控制目標下微調預訓練的離散擴散模型。我們在生物序列擴散模型的單目標和多目標微調上驗證了我們的框架,展示了TR2-D2在離散序列生成中進行可靠獎勵引導微調的整體有效性。
English
Reinforcement learning with stochastic optimal control offers a promising
framework for diffusion fine-tuning, where a pre-trained diffusion model is
optimized to generate paths that lead to a reward-tilted distribution. While
these approaches enable optimization without access to explicit samples from
the optimal distribution, they require training on rollouts under the current
fine-tuned model, making them susceptible to reinforcing sub-optimal
trajectories that yield poor rewards. To overcome this challenge, we introduce
TRee Search Guided TRajectory-Aware Fine-Tuning for Discrete Diffusion
(TR2-D2), a novel framework that optimizes reward-guided discrete diffusion
trajectories with tree search to construct replay buffers for trajectory-aware
fine-tuning. These buffers are generated using Monte Carlo Tree Search (MCTS)
and subsequently used to fine-tune a pre-trained discrete diffusion model under
a stochastic optimal control objective. We validate our framework on single-
and multi-objective fine-tuning of biological sequence diffusion models,
highlighting the overall effectiveness of TR2-D2 for reliable reward-guided
fine-tuning in discrete sequence generation.