Flash-GRPO: One-Step Policy Optimizationによる動画拡散の効率的なアライメント

要旨

グループ相対方策最適化（Group Relative Policy Optimization）は、ビデオ拡散モデルを人間の嗜好に合わせるために不可欠な手法として登場したが、重大な計算上のボトルネックに直面している。すなわち、140億パラメータモデルの訓練には、実験あたり通常数百GPU日を要する。既存の効率化手法は、スライディングウィンドウサブサンプリングによる訓練タイムステップの削減を通じてコストを低減するが、根本的に最適化を損ない、深刻な不安定性を示し、全軌道性能に達することができない。本稿では、Flash-GRPOを提案する。これは単一ステップの訓練フレームワークであり、低計算予算下でアライメント品質において全軌道訓練を上回り、かつ訓練効率を大幅に向上させる。Flash-GRPOは2つの重要な課題に取り組む。等時性グループ化（iso-temporal grouping）は、プロンプト単位の時間的一貫性を強制することでタイムステップ交絡分散を除去し、方策性能をタイムステップ難易度から切り離す。時間勾配補正（temporal gradient rectification）は、タイムステップ間で勾配の大きさに大きな不整合を引き起こす時間依存スケーリング因子を中和する。13億から140億パラメータモデルでの実験により、Flash-GRPOの有効性が検証され、一貫した安定性と最先端のアライメント品質を伴う大幅な訓練高速化が実証された。

English

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.