影片生成中強化學習的流形感知探索方法 這篇論文提出了一種針對影片生成任務的強化學習探索策略,該策略通過感知數據流形的幾何結構來提升探索效率。傳統強化學習方法在高維影片空間中常面臨探索效率低下的問題,本文通過學習影片數據的內在流形結構,引導智能體在更具語義意義的方向進行探索。具體而言,我們構建了一個流形感知的獎勵函數,該函數結合了影片內容的語義連續性和視覺質量指標,使智能體能夠在保持生成影片連貫性的同時探索多樣化的內容變化。實驗結果表明,與基於隨機探索或啟發式探索的基準方法相比,我們的方法在影片質量、多樣性和訓練穩定性方面均有顯著提升。該方法為解決高維媒體生成任務中的探索挑戰提供了新的思路。
Manifold-Aware Exploration for Reinforcement Learning in Video Generation
March 23, 2026
作者: Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, Qifeng Chen, Harry Yang
cs.AI
摘要
針對影片生成的群組相對策略優化(GRPO)方法(如FlowGRPO)的可靠性仍遠不及語言模型與影像領域的對應技術。此差距源於影片生成具有複雜的解空間,且用於探索的ODE-to-SDE轉換可能引入過多噪聲,導致生成品質下降與獎勵估計可靠性降低,進而影響訓練後對齊的穩定性。為解決此問題,我們將預訓練模型視為定義了一個有效的影片數據流形,並將核心問題轉化為將探索限制在該流形鄰域內,以確保生成品質與獎勵估計的可靠性。我們提出SAGE-GRPO(基於探索的穩定對齊方法),在微觀與宏觀層面施加雙重約束:微觀層面推導具對數曲率校正的精確流形感知SDE,並引入梯度範數均衡器以穩定時間步的採樣與更新;宏觀層面採用雙重信賴域機制,通過週期性移動錨點與分步約束,使信賴域追蹤更接近流形的檢查點並限制長時程漂移。我們在HunyuanVideo1.5上以原始VideoAlign作為獎勵模型進行評估,結果顯示SAGE-GRPO在VQ、MQ、TA及視覺指標(CLIPScore、PickScore)上均穩定超越現有方法,展現出更優異的獎勵最大化能力與整體影片品質。程式碼與視覺展示見於 https://dungeonmassster.github.io/SAGE-GRPO-Page/。
English
Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE-GRPO-Page/.