VGGRPO:邁向具世界一致性的四維潛在獎勵影片生成
VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
March 27, 2026
作者: Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, Marta Tintore Gazulla
cs.AI
摘要
大規模視訊擴散模型雖能實現出色的視覺品質,卻常難以維持幾何一致性。現有改進方法或通過增設生成器模組,或採用幾何感知對齊技術,但架構修改可能削弱網路規模預訓練模型的泛化能力,而現有對齊方法僅適用於靜態場景,且依賴需反覆VAE解碼的RGB空間獎勵機制,導致巨大計算開銷且無法泛化至高度動態的真實場景。為兼顧預訓練能力與幾何一致性,我們提出VGGRPO(視覺幾何GRPO)——一種潛在幾何引導的影片後訓練框架。VGGRPO通過潛在幾何模型(LGM)橋接影片擴散潛在空間與幾何基礎模型,實現從潛在空間直接解碼場景幾何。藉由具備四維重建能力的幾何模型構建LGM,VGGRPO天然支援動態場景,突破了既有方法的靜態侷限。在此基礎上,我們實施潛在空間群組相對策略優化,融合兩種互補獎勵:懲罰抖動軌跡的相機運動平滑獎勵,以及強化跨視角幾何連貫性的重投影一致性獎勵。靜態與動態場景基準測試表明,VGGRPO在提升相機穩定性、幾何一致性和整體品質的同時,消除了昂貴的VAE解碼開銷,使潛在空間幾何引導的強化學習成為高效靈活的世界一致性影片生成方案。
English
Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.