Flash-GRPO: 단일 단계 정책 최적화를 통한 비디오 확산의 효율적 정렬

초록

그룹 상대 정책 최적화(Group Relative Policy Optimization)는 비디오 확산 모델을 인간 선호도에 맞추는 데 필수적인 방법으로 부상했지만, 심각한 계산 병목 현상에 직면해 있다: 140억 파라미터 모델을 훈련하려면 실험당 수백 GPU 일(日)이 소요된다. 기존 효율성 방법들은 슬라이딩 윈도우 서브샘플링을 통해 훈련 타임스텝을 줄여 비용을 낮추지만, 최적화를 근본적으로 손상시켜 심각한 불안정성을 보이며 전체 궤적 성능에 도달하지 못한다. 본 논문에서는 Flash-GRPO를 제안한다. 이는 단일 스텝 훈련 프레임워크로, 낮은 계산 예산에서 정렬 품질 측면에서 전체 궤적 훈련을 능가하면서 훈련 효율성을 크게 향상시킨다. Flash-GRPO는 두 가지 핵심 과제를 해결한다: 등시적 그룹화(iso-temporal grouping)는 프롬프트 단위 시간적 일관성을 강제하여 타임스텝 혼재 분산(timestep-confounded variance)을 제거함으로써 정책 성능과 타임스텝 난이도를 분리한다; 시간적 기울기 정정(temporal gradient rectification)은 타임스텝 간에 극도로 불일치하는 기울기 크기를 유발하는 시간 의존적 스케일링 인자를 중화한다. 1.3B에서 14B 파라미터 모델에 대한 실험을 통해 Flash-GRPO의 효과성을 검증했으며, 일관된 안정성 및 최첨단 정렬 품질과 함께 상당한 훈련 가속화를 입증했다.

English

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.