비디오 생성을 위한 강화 학습의 매니폴드 인식 탐색

초록

비디오 생성 분야의 FlowGRPO와 같은 그룹 상대 정책 최적화(GRPO) 방법은 언어 모델 및 이미지 생성에 적용된 동급 방법들에 비해 여전히 신뢰도가 크게 떨어진다. 이러한 격차는 비디오 생성이 복잡한 해결 공간을 가지며, 탐색을 위해 사용되는 ODE-to-SDE 변환이 과도한 노이즈를 유입시켜 rollout 품질을 저하시키고 보상 추정의 신뢰성을 낮춤으로써 사후 훈련 정렬을 불안정하게 만들기 때문에 발생한다. 이 문제를 해결하기 위해 우리는 사전 훈련된 모델이 유효한 비디오 데이터 매니폴드를 정의한다고 보고, 핵심 문제를 이 매니폴드 주변 내에서 탐색을 제한하는 것으로 정식화하여 rollout 품질이 유지되고 보상 추정이 신뢰할 수 있도록 한다. 우리는 미시적 및 거시적 수준에서 모두 제약을 적용하는 SAGE-GRPO(탐색을 통한 안정적 정렬)를 제안한다. 미시적 수준에서는 로그 곡률 보정을 포함한 정확한 매니폴드 인식 SDE를 유도하고, 시간 단계별 샘플링 및 업데이트를 안정화하기 위한 그래디언트 노름 균등기를 도입한다. 거시적 수준에서는 주기적 이동 앵커와 단계적 제약을 갖는 이중 신뢰 영역을 사용하여 신뢰 영역이 매니폴드에 더 가까운 체크포인트를 추적하고 장기간 드리프트를 제한하도록 한다. 우리는 HunyuanVideo1.5에서 원본 VideoAlign을 보상 모델로 사용하여 SAGE-GRPO를 평가한 결과, VQ, MQ, TA 및 시각적 지표(CLIPScore, PickScore)에서 기존 방법 대비 지속적인 향상을 관찰하여 보상 극대화와 전반적인 비디오 품질 모두에서 우수한 성능을 입증했다. 코드와 시각 자료 갤러리는 https://dungeonmassster.github.io/SAGE-GRPO-Page/에서 확인할 수 있다.

English

Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE-GRPO-Page/.

비디오 생성을 위한 강화 학습의 매니폴드 인식 탐색

Manifold-Aware Exploration for Reinforcement Learning in Video Generation

초록

Support