DiffusionOPD: 拡散モデルにおけるオン・ポリシー蒸留の統一的視点

要旨

強化学習は、拡散ベースのテキストから画像へのモデルを改善するための強力なツールとして登場したが、既存手法は主に単一タスク最適化に限定されている。強化学習を複数タスクに拡張することは困難である。共同最適化はタスク間干渉と不均衡の問題を抱え、カスケード強化学習は煩雑であり、破滅的忘却を起こしやすい。我々は、オンラインポリシー蒸留（OPD）に基づく拡散モデルのための新しいマルチタスク訓練パラダイムであるDiffusionOPDを提案する。DiffusionOPDはまず、タスク固有の教師を独立に訓練し、その後、学生自身のロールアウト軌跡に沿ってそれらの能力を統一的な学生に蒸留する。これにより、単一タスク探索とマルチタスク統合が分離され、すべてのタスクをゼロから共同で解く最適化負荷が回避される。理論的には、OPDフレームワークを離散トークンから連続状態マルコフ過程へ拡張し、平均マッチングを介して確率的SDEと決定論的ODEの両方のリファインメントを統合する、閉形式の1ステップあたりのKL目的関数を導出する。我々は、この解析的勾配が従来のPPOスタイルの政策勾配と比較して、より低い分散とより良い一般性を提供することを形式的かつ経験的に示す。広範な実験により、DiffusionOPDが訓練効率と最終性能においてマルチリワードRLおよびカスケードRLのベースラインを一貫して上回り、評価されたすべてのベンチマークで最新の結果を達成することが示される。

English

Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.