Flash-GRPO：透過一步策略優化實現視訊擴散的高效對齊

摘要

群體相對政策優化已成為將視頻擴散模型與人類偏好對齊的關鍵技術，但面臨嚴重的計算瓶頸：訓練一個140億參數的模型，每次實驗通常需要數百個GPU天。現有方法通過滑動窗口子採樣訓練時間步來降低計算成本，但本質上犧牲了優化效果，表現出嚴重的不穩定性，且無法達到完整軌跡訓練的性能。我們提出Flash-GRPO，一個單步訓練框架，在低計算預算下，其對齊質量超越完整軌跡訓練，同時顯著提升訓練效率。Flash-GRPO解決了兩個關鍵挑戰：等時分組通過強化提示層面的時間一致性來消除時間步混淆的方差，從而解耦策略性能與時間步難度；時間梯度修正抵消了導致不同時間步梯度量級極度不一致的時間依賴縮放因子。在1.3B至14B參數模型上的實驗驗證了Flash-GRPO的有效性，展現出顯著的訓練加速效果、一致的穩定性以及最先進的對齊質量。

English

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.