DanceGRPO：釋放GRPO在視覺生成中的潛力

摘要

近期在生成模型領域的突破——特別是擴散模型和校正流——徹底革新了視覺內容創作，然而如何使模型輸出與人類偏好保持一致仍是一個關鍵挑戰。現有的基於強化學習（RL）的視覺生成方法面臨著幾個重大限制：與現代基於常微分方程（ODE）的採樣範式不相容、大規模訓練中的不穩定性，以及缺乏對視頻生成的驗證。本文介紹了DanceGRPO，這是首個將群組相對策略優化（GRPO）應用於視覺生成範式的統一框架，釋放了一種跨兩種生成範式（擴散模型和校正流）、三項任務（文本到圖像、文本到視頻、圖像到視頻）、四種基礎模型（Stable Diffusion、HunyuanVideo、FLUX、SkyReel-I2V）和五種獎勵模型（圖像/視頻美學、文本圖像對齊、視頻運動質量及二元獎勵）的統一RL算法。據我們所知，DanceGRPO是首個能夠無縫適應多樣生成範式、任務、基礎模型和獎勵模型的基於RL的統一框架。DanceGRPO展現了一致且顯著的改進，在HPS-v2.1、CLIP Score、VideoAlign和GenEval等基準測試中，其表現超越基線高達181%。值得注意的是，DanceGRPO不僅能穩定複雜視頻生成的策略優化，還能讓生成策略更好地捕捉去噪軌跡以實現Best-of-N推理擴展，並從稀疏的二元反饋中學習。我們的結果確立了DanceGRPO作為視覺生成中擴展基於人類反饋的強化學習（RLHF）任務的強大且多功能的解決方案，為協調強化學習與視覺合成提供了新的見解。代碼將被公開。

English

Recent breakthroughs in generative models-particularly diffusion models and rectified flows-have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy optimization for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.

DanceGRPO：釋放GRPO在視覺生成中的潛力

DanceGRPO: Unleashing GRPO on Visual Generation

摘要

Support