视频生成中基于流形感知探索的强化学习
Manifold-Aware Exploration for Reinforcement Learning in Video Generation
March 23, 2026
作者: Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, Qifeng Chen, Harry Yang
cs.AI
摘要
针对视频生成的群组相对策略优化(GRPO)方法(如FlowGRPO)的可靠性仍远逊于语言模型与图像生成领域的同类技术。这一差距源于视频生成具有复杂的解空间,且用于探索的常微分方程-随机微分方程(ODE-to-SDE)转换会引入过量噪声,导致生成质量下降、奖励估计可靠性降低,进而影响训练后对齐的稳定性。为解决该问题,我们将预训练模型视为定义了一个有效的视频数据流形,并将核心问题转化为约束探索过程始终处于该流形邻域内,从而保证生成质量与奖励估计的可靠性。本文提出SAGE-GRPO(基于稳定探索的对齐方法),在微观与宏观层面实施双重约束:微观层面推导出具有对数曲率修正的精确流形感知SDE,并引入梯度范数均衡器以稳定不同时间步的采样与更新;宏观层面采用周期性移动锚点的双信任区域机制,配合逐步约束策略,使信任区域能追踪更接近流形的检查点并限制长时序漂移。我们在HunyuanVideo1.5框架下以原始VideoAlign作为奖励模型进行评测,发现SAGE-GRPO在VQ、MQ、TA及视觉指标(CLIPScore、PickScore)上均持续优于现有方法,在奖励最大化与整体视频质量方面均展现出卓越性能。代码及视觉展示详见https://dungeonmassster.github.io/SAGE-GRPO-Page/。
English
Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE-GRPO-Page/.