Flash-GRPO: 通过单步策略优化实现视频扩散的高效对齐

摘要

组相对策略优化已成为使视频扩散模型与人类偏好对齐的关键技术，但面临一个关键的计算瓶颈：训练一个140亿参数的模型通常每个实验需要数百个GPU天。现有效率方法通过滑动窗口子采样训练时间步来降低计算成本，但根本上损害了优化效果，表现出严重的不稳定性，无法达到完整轨迹性能。我们提出Flash-GRPO，一种单步训练框架，在低计算预算下实现对完整轨迹训练的超越，不仅在对齐质量上更优，同时大幅提升训练效率。Flash-GRPO解决了两个关键挑战：等时分组通过强制提示维度的时序一致性消除时间步混杂方差，将策略性能与时间步难度解耦；时间梯度修正则中和了导致不同时间步梯度幅度极度不一致的时间依赖性缩放因子。在13亿到140亿参数模型上的实验验证了Flash-GRPO的有效性，展示了显著的训练加速、一致的稳定性以及最先进的对齐质量。

English

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.