拡散モデルの強化：直接的なグループ選好最適化によるアプローチ

要旨

Group Relative Preference Optimization（GRPO）のような強化学習手法は大規模言語モデルの性能を大幅に向上させてきたが、それらを拡散モデルに適用することは依然として課題となっている。特に、GRPOは確率的ポリシーを必要とするが、最もコスト効率の良い拡散サンプラーは決定論的なODEに基づいている。最近の研究では、確率性を導入するために非効率なSDEベースのサンプラーを使用することでこの問題に対処しているが、モデルに依存しないガウシアンノイズに依存しているため、収束が遅くなる。この問題を解決するために、我々はDirect Group Preference Optimization（DGPO）を提案する。DGPOは、ポリシー勾配フレームワークを完全に排除した新しいオンライン強化学習アルゴリズムである。DGPOはグループレベルの選好から直接学習し、グループ内のサンプルの相対情報を活用する。この設計により、非効率な確率的ポリシーが不要となり、効率的な決定論的ODEサンプラーと高速な学習が可能になる。広範な実験結果から、DGPOは既存の最先端手法よりも約20倍速く学習し、ドメイン内およびドメイン外の報酬指標において優れた性能を達成することが示された。コードはhttps://github.com/Luo-Yihong/DGPOで公開されている。

English

While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.

拡散モデルの強化：直接的なグループ選好最適化によるアプローチ

Reinforcing Diffusion Models by Direct Group Preference Optimization

要旨

Support