DanceGRPO: ビジュアル生成におけるGRPOの解放

要旨

近年の生成モデル、特に拡散モデルと整流フローにおける画期的な進展は、視覚コンテンツの作成に革命をもたらしましたが、モデルの出力を人間の好みに合わせることは依然として重要な課題です。既存の強化学習（RL）ベースの視覚生成手法は、現代の常微分方程式（ODE）ベースのサンプリングパラダイムとの非互換性、大規模トレーニングにおける不安定性、ビデオ生成の検証の欠如といった重大な制限に直面しています。本論文では、DanceGRPOを紹介します。これは、Group Relative Policy Optimization（GRPO）を視覚生成パラダイムに適応させる初の統一フレームワークであり、2つの生成パラダイム（拡散モデルと整流フロー）、3つのタスク（テキストから画像、テキストからビデオ、画像からビデオ）、4つの基盤モデル（Stable Diffusion、HunyuanVideo、FLUX、SkyReel-I2V）、および5つの報酬モデル（画像/ビデオの美学、テキストと画像の整合性、ビデオの動きの品質、二値報酬）にわたって一つの統一されたRLアルゴリズムを解放します。私たちの知る限り、DanceGRPOは、多様な生成パラダイム、タスク、基盤モデル、報酬モデルにわたってシームレスに適応可能な初のRLベースの統一フレームワークです。DanceGRPOは、HPS-v2.1、CLIP Score、VideoAlign、GenEvalなどのベンチマークにおいて、ベースラインを最大181%上回る一貫した大幅な改善を示しています。特に、DanceGRPOは、複雑なビデオ生成のためのポリシー最適化を安定化させるだけでなく、生成ポリシーがノイズ除去軌跡をより良く捉えてBest-of-N推論スケーリングを行い、疎な二値フィードバックから学習することを可能にします。私たちの結果は、DanceGRPOが視覚生成における人間のフィードバックからの強化学習（RLHF）タスクをスケーリングするための堅牢で汎用的なソリューションであることを確立し、強化学習と視覚合成の調和に関する新たな洞察を提供します。コードは公開される予定です。

English

Recent breakthroughs in generative models-particularly diffusion models and rectified flows-have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy optimization for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.

DanceGRPO: ビジュアル生成におけるGRPOの解放

DanceGRPO: Unleashing GRPO on Visual Generation

要旨

Support