TR2-D2：基於樹搜索引導的軌跡感知微調用於離散擴散

摘要

結合隨機最優控制的強化學習為擴散微調提供了一個極具前景的框架，其中預訓練的擴散模型被優化以生成導向獎勵傾斜分佈的路徑。雖然這些方法能夠在無需從最優分佈中獲取顯式樣本的情況下進行優化，但它們需要在當前微調模型下對軌跡進行訓練，這使得它們容易強化那些產生低獎勵的次優軌跡。為克服這一挑戰，我們引入了基於樹搜索引導的軌跡感知微調框架（TR2-D2），這是一種新穎的框架，它利用樹搜索來優化獎勵引導的離散擴散軌跡，從而構建用於軌跡感知微調的回放緩衝區。這些緩衝區是通過蒙特卡羅樹搜索（MCTS）生成的，隨後用於在隨機最優控制目標下微調預訓練的離散擴散模型。我們在生物序列擴散模型的單目標和多目標微調上驗證了我們的框架，展示了TR2-D2在離散序列生成中進行可靠獎勵引導微調的整體有效性。

English

Reinforcement learning with stochastic optimal control offers a promising framework for diffusion fine-tuning, where a pre-trained diffusion model is optimized to generate paths that lead to a reward-tilted distribution. While these approaches enable optimization without access to explicit samples from the optimal distribution, they require training on rollouts under the current fine-tuned model, making them susceptible to reinforcing sub-optimal trajectories that yield poor rewards. To overcome this challenge, we introduce TRee Search Guided TRajectory-Aware Fine-Tuning for Discrete Diffusion (TR2-D2), a novel framework that optimizes reward-guided discrete diffusion trajectories with tree search to construct replay buffers for trajectory-aware fine-tuning. These buffers are generated using Monte Carlo Tree Search (MCTS) and subsequently used to fine-tune a pre-trained discrete diffusion model under a stochastic optimal control objective. We validate our framework on single- and multi-objective fine-tuning of biological sequence diffusion models, highlighting the overall effectiveness of TR2-D2 for reliable reward-guided fine-tuning in discrete sequence generation.

TR2-D2：基於樹搜索引導的軌跡感知微調用於離散擴散

TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion

摘要

Support