VGGRPO：迈向基于4D潜在奖励的世界一致性视频生成

摘要

大规模视频扩散模型在视觉质量上表现卓越，但往往难以保持几何一致性。现有方法通过为生成器添加额外模块或采用几何感知对齐来提升一致性，但架构修改可能损害互联网规模预训练模型的泛化能力，而现有对齐方法仅适用于静态场景且依赖RGB空间奖励——这类方法需要重复进行VAE解码，计算开销巨大且难以泛化至高度动态的真实场景。为在保持预训练能力的同时提升几何一致性，我们提出VGGRPO（视觉几何GRPO），一种基于潜在空间几何指导的视频后训练框架。VGGRPO引入潜在几何模型（LGM），将视频扩散潜在特征与几何基础模型相衔接，实现从潜在空间直接解码场景几何。通过采用具备4D重建能力的几何模型构建LGM，VGGRPO天然支持动态场景，突破了传统方法局限于静态场景的瓶颈。在此基础上，我们执行潜在空间群组相对策略优化，融合两种互补奖励：惩罚抖动轨迹的相机运动平滑性奖励，以及强化多视角几何一致性的重投影一致性奖励。在静态与动态场景基准测试中，VGGRPO在提升相机稳定性、几何一致性和整体质量的同时，消除了昂贵的VAE解码开销，使潜在空间几何指导的强化学习成为高效灵活的世界一致性视频生成方案。

English

Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.