Flow-DPPO：面向流匹配模型的散度近端策略优化

摘要

近期研究表明，在线强化学习能显著提升图像与视频生成中流匹配模型的质量与对齐能力。Flow-GRPO 和 CPS 等方法将去噪过程建模为马尔可夫决策过程，并采用 PPO 风格的比率裁剪来约束信任区域。然而，我们认为比率裁剪在结构上并不适用于流模型：新旧策略之间的概率比率是对真实策略散度的有噪声单样本估计，这会导致轨迹中某些区域过度约束，而另一些区域约束不足。为此，我们提出 Flow-DPPO（流散度近端策略优化），用散度近端约束替代比率裁剪。一个关键观察是，流模型中每步策略服从高斯分布，这使得新旧策略之间的 KL 散度能够被精确且低成本地计算。Flow-DPPO 采用非对称散度掩码，仅在策略同时偏离信任区域并违反散度阈值时阻止梯度更新。实验表明，Flow-DPPO 在获得更高奖励的同时提升了 KL 近端效率，减轻了灾难性遗忘，促进了均衡的多目标优化，并实现了比率裁剪退化的稳定多轮次训练。代码与模型已开源：https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO。

English

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.