ChatPaper.aiChatPaper

Flow-DPPO:用於流匹配模型的散度近端策略優化

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

June 9, 2026
作者: Bowen Ping, Xiangxin Zhou, Penghui Qi, Minnan Luo, Liefeng Bo, Tianyu Pang
cs.AI

摘要

近期研究證實,線上強化學習(Reinforcement Learning, RL)能顯著提升影像與影片生成中流匹配模型的品質與對齊程度。Flow-GRPO 與 CPS 等方法將去噪過程視為馬可夫決策過程(Markov Decision Process),並採用類似 PPO(Proximal Policy Optimization)的比率裁剪(ratio clipping)來約束信任區域。然而,我們認為比率裁剪本質上不適用於流模型:新舊策略間的機率比率是對真實策略發散度的含噪單樣本估計,這會導致軌跡部分區域過度約束、部分區域約束不足。我們提出 Flow-DPPO(Flow Divergence Proximal Policy Optimization),以發散度近端約束取代比率裁剪。關鍵觀察在於,流模型中每一步的策略均為高斯分布,可精確且低成本的計算新舊策略間的 KL 散度(KL divergence)。Flow-DPPO 採用非對稱發散遮罩,僅在更新同時偏離信任區域且違反發散度閾值時,才阻擋梯度更新。實驗結果顯示,Flow-DPPO 能獲得更高獎勵、具備更佳的 KL 近端效率、緩解災難性遺忘、促進平衡的多目標最佳化,並在比率裁剪會劣化的情況下實現穩定的多輪訓練。程式碼與模型請參閱:https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO
English
Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.