DanceGRPO：将GRPO技术应用于视觉生成领域

摘要

近期，生成模型领域——尤其是扩散模型和修正流——取得了突破性进展，彻底革新了视觉内容创作，然而，如何使模型输出与人类偏好保持一致仍是一个关键挑战。现有的基于强化学习（RL）的视觉生成方法面临几大局限：与现代基于常微分方程（ODEs）的采样范式不兼容、大规模训练中的不稳定性，以及视频生成验证的缺失。本文提出了DanceGRPO，这是首个将群体相对策略优化（GRPO）适配到视觉生成范式的统一框架，实现了单一RL算法在两种生成范式（扩散模型与修正流）、三项任务（文本到图像、文本到视频、图像到视频）、四大基础模型（Stable Diffusion、HunyuanVideo、FLUX、SkyReel-I2V）及五种奖励模型（图像/视频美学、文本图像对齐、视频运动质量及二元奖励）间的无缝跨越。据我们所知，DanceGRPO是首个能够跨多种生成范式、任务、基础模型及奖励模型灵活适应的基于RL的统一框架。DanceGRPO展现了持续且显著的改进，在HPS-v2.1、CLIP Score、VideoAlign和GenEval等基准测试中，其表现超越基线高达181%。尤为突出的是，DanceGRPO不仅能够稳定复杂视频生成的策略优化，还能使生成策略更好地捕捉去噪轨迹以实现Best-of-N推理扩展，并从稀疏的二元反馈中学习。我们的成果确立了DanceGRPO作为视觉生成中扩展基于人类反馈的强化学习（RLHF）任务的强大多功能解决方案，为融合强化学习与视觉合成提供了新见解。代码即将发布。

English

Recent breakthroughs in generative models-particularly diffusion models and rectified flows-have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy optimization for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.

DanceGRPO：将GRPO技术应用于视觉生成领域

DanceGRPO: Unleashing GRPO on Visual Generation

摘要

Support