DiffusionOPD：擴散模型中同策略蒸餾的統一視角

摘要

强化学习已成为改进基于扩散的文本到图像模型的有力工具，但现有方法大多局限于单任务优化。将强化学习扩展到多任务具有挑战性：联合优化存在跨任务干扰和不平衡问题，而级联强化学习则繁琐且容易发生灾难性遗忘。我们提出DiffusionOPD，一种基于在线策略蒸馏（OPD）的扩散模型多任务训练新范式。DiffusionOPD首先独立训练任务特定的教师模型，然后沿着学生自身的轨迹将其能力蒸馏到统一的学生模型中。这实现了单任务探索与多任务集成的解耦，避免了从零开始联合求解所有任务的优化负担。理论上，我们将OPD框架从离散令牌扩展到连续状态马尔可夫过程，推导出闭式逐步KL目标函数，通过均值匹配统一了随机SDE和确定性ODE的精化。我们通过理论和实验证明，与传统PPO风格的策略梯度相比，这种解析梯度具有更低的方差和更好的泛化性。大量实验表明，DiffusionOPD在训练效率和最终性能上持续超越多奖励RL和级联RL基线，并在所有评估基准上取得了最先进的结果。

English

Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.