DiffusionOPD: 확산 모델에서 온-정책 증류의 통합적 관점

초록

강화 학습은 확산 기반 텍스트-이미지 모델을 개선하는 강력한 도구로 부상했지만, 기존 방법은 주로 단일 작업 최적화에 국한되어 있다. 강화 학습을 다중 작업으로 확장하는 것은 어려운데, 공동 최적화는 작업 간 간섭과 불균형을 겪는 반면, 캐스케이드 강화 학습은 번거롭고 파괴적 망각에 취약하다. 본 논문에서는 온라인 정책 증류(OPD)에 기반한 확산 모델의 새로운 다중 작업 훈련 패러다임인 DiffusionOPD를 제안한다. DiffusionOPD는 먼저 작업별 교사 모델을 독립적으로 훈련시킨 후, 학생 모델 자체의 롤아웃 궤적을 따라 이들의 능력을 통합 학생 모델로 증류한다. 이는 단일 작업 탐색과 다중 작업 통합을 분리하여 모든 작업을 처음부터 공동으로 해결해야 하는 최적화 부담을 피한다. 이론적으로, 우리는 OPD 프레임워크를 이산 토큰에서 연속 상태 마르코프 과정으로 확장하고, 평균 일치를 통해 확률적 SDE와 결정적 ODE 미세 조정을 모두 통합하는 폐쇄형 단계별 KL 목적 함수를 유도한다. 우리는 이 해석적 기울기가 기존 PPO 스타일 정책 기울기에 비해 더 낮은 분산과 더 나은 일반성을 제공함을 공식적 및 실증적으로 입증한다. 광범위한 실험을 통해 DiffusionOPD가 훈련 효율성과 최종 성능 모두에서 다중 보상 강화 학습 및 캐스케이드 강화 학습 기준을 지속적으로 능가하며, 평가된 모든 벤치마크에서 최고 수준의 결과를 달성함을 보여준다.

English

Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.